Tag Archives: file

Relocating Celerra Checkpoints

On our NS-960 Celerra we have multiple storage pools defined and one of them that is specifically defined for checkpoint storage. Out Of the 100 or so file systems that we run a checkpoint schedule on, about 2/3 of them were incorrectly writing their checkpoints to the production file system pool rather than the defined checkpoint pool, and the production pool was starting to fill up. I started to research how you could change where checkpoints are stored. Unfortunately you can’t actually relocate existing checkpoints, you need to start over.

In order to change where checkpoints are stored, you need to stop and delete any running replications (which automatically store root replication checkpoints) and delete all current checkpoints for the specific file system you’re working on. Depending on your checkpoint schedule and how often it runs you may want to pause it temporarily. Having to stop and delete remote replications was painful for me as I compete for bandwidth with our data domain backups to our DR site. Because of that I’ve been working on these one at a time.

Once you’ve deleted the relevant checkpoints and replications, you can choose where to store the new checkpoints by creating a single checkpoint first before making a schedule. From the GUI, go to Data Protection | Snapshots | Checkpoints tab, and click create. If a checkpoint does not exist for the file system, it will give you a choice of which pool you’d like to store it in. Below is what you’ll see in the popup window.

Choose Data Mover [Drop Down List ▼]
Production File System [Drop Down List ▼]
Writeable Checkpoint [Checkbox]:
Data Movers: server_2
Checkpoint Name: [Fill in the Blank]

Configure Checkpoint Storage:
There are no checkpoints currently on this file system. Please specify how to allocate
storage for this and future checkpoints of this file system.

Create from:
*Storage Pool [Option Box]
*Meta Volume [Option Box]

Current Storage System: CLARiiON CX4-960 APM00900900999
Storage Pool: [Drop Down List ▼]    <—*Select the appropriate Storage Pool Here*
Storage Capacity (MB): [Fill in the Blank]

Auto Extend Configuration:
High Water Mark: [Drop Down List for Percentage ▼]

Maximum Storage Capacity (MB): [Fill in the Blank]



Close requests fail on data mover when using Riverbed Steelhead appliance

We recently had a problem with one of our corporate applications having file close requests fail, resulting in 200,000+ open files on our production data mover.  This was causing numerous issues within the application.  We determined that the problem was a result of our Riverbed Steelhead appliance requiring a certain level of DART code in order to properly close the files.  The Steelhead applicance would fail when attempting to optimize SMBV2 connections.

Because a DART code upgrade is required to resolve the problem, the only temporary fix is to reboot the data mover.  I wrote a quick script on the Celerra to grab the number of open files, write it to a text file, and publish to our internal web server.  The command to check how many open files are on the data mover is below.

This command provides all the detailed information:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Timestamp   Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
11:15:36     3379      905     9584        11        9      272        30         1856     4915   

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Summary     Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
Minimum      3379      905     9584        11        9      272        30         1856     4915   
Average      3379      905     9584        11        9      272        30         1856     4915   
Maximum      3379      905     9584        11        9      272        30         1856     4915   

Adding a grep for Maximum and using awk to grab only the last column, this command will output only the number of open files, rather than the large output above:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1 | grep Maximum | awk ‘{print $10}’

The output of that command would simply be ‘4915’ based on the sample full output I used above.

The solution number from Riverbed’s knowledgebase is S16257.  Your DART code needs to be at least or 7.0.52.  You will also see in your steelhead logs a message similar to the one below indicating that the close request has failed for a particular file:

Sep 1 18:19:52 steelhead port[9444]: [smb2cfe.WARN] 58556726 {} Close failed for fid: 888819cd-z496-7fa2-2735-0000ffffffff with ntstatus: NT_STATUS_INVALID_PARAMETER

Collecting info on active shares, clients, protocols, and authentication on the VNX

I had a comment in one of my Celerra/VNX posts asking for more specific info on listing and counting shares, clients, protocol and authentication information, as well as virus scan information.  I knew the answers to most of those questions however  I’d need to pull out the Celerra documentation from EMC for virus scan info.  I thought it might be more useful to put this information into a new post rather than simply replying to a comment on an old post.

How to list and count shares/exports (by protocol):

You can use the server_export command to list and count how many shares you have.

server_export server_2 -Protocol cifs -list -all:  This command will give you a list of all CIFS Shares across all of your CIFS servers.  It will count a file system twice if it’s shared on more than one CIFS server.

server_export server_2 -Protocol cifs -list -all | grep :  This will give you a list of all CIFS shares on a specific CIFS server

server_export server_2 -Protocol cifs -list -all | grep | wc:  This will give you the number of CIFS shares on a specific CIFS server.  The “wc” command will output three numbers, the first number listed is the number of shares.

server_export server_2 -Protocol nfs -list -all:  This command will give you a list of all NFS exports.  It’s just like the previous commands, you can add “| wc” at the end to get a count.

How to list client connections by OS type:

To obtain information for the number of client connections you have by OS type, you’d have to use the server_cifs audit command.

To get a full list of every active connection by client type, use this command:

server_cifs server_2 -option audit,full | grep “Client(“

The output would look like this:

|||| AUDIT Ctx=0x022a6bdc08, ref=2, W2K Client( Port=49863/445
|||| AUDIT Ctx=0x01f18efc08, ref=2, XP Client( Port=3890/445
|||| AUDIT Ctx=0x02193a2408, ref=2, W2K Client( Port=59027/445
|||| AUDIT Ctx=0x01b89c2808, ref=2, Fedora6 Client( Port=17130/445
|||| AUDIT Ctx=0x0203ae3008, ref=2, Fedora6 Client( Port=55731/445

In this case, if I wanted to count only the number of Fedora6 clients, I’d use the command server_cifs server_2 -option audit,full | grep “Fedora6 Client”.  I could then add “| wc” at the end to get a count.

To do a full audit report:

The command server_cifs server_2 -option audit,full will do a full, detailed audit report and should capture just about anything else you’d need.  Every connection will have a detailed audit included in the report. Based on that output, it would be easy to run the command with a grep statement to pull only the information out that you need to create custom reports.

Below is a subset of what the output looks like from that command:

|||| AUDIT Ctx=0x0177206407, ref=2, W2K8 Client( Port=65340/445
||| CIFSSERVER1[DOMAIN] on if=<interface_name>
||| CurrentDC 0x0169fee808=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=c2de9f99-1945-11e2-a512-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0x1 NTcred(0x0125fc9408 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
|| Cnxp(0x0230414c08), Name=FileSystem1, cUid=0x1 Tid=0x1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem1
| NTFSExtInfo: shareFS:fsid=49, rc=830, listFS:nb=1 [fsid=49,rc=830]

|||| AUDIT Ctx=0x0210f89007, ref=2, W2K8 Client( Port=51607/445
||| CIFSSERVER1[DOMAIN] on <interface_name>
||| CurrentDC 0x01f79aa008=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K8, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=5b410977-bace-11e1-953b-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0x1 NTcred(0x01c2367408 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
|| Cnxp(0x0195d2a408), Name=Filesystem2, cUid=0x1 Tid=0x1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem2
| NTFSExtInfo: shareFS:fsid=4214, rc=19, listFS:nb=0

|||| AUDIT Ctx=0x006aae8408, ref=2, XP Client( Port=1258/445
 ||| CIFSSERVER1[DOMAIN] on if=<interface_name>
 ||| CurrentDC 0x01f79aa008=<Domain_Controller>
 ||| Proto=NT1, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
 ||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
 ||| Uid=0x3f NTcred(0x01ccebc008 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
 || Cnxp(0x019edd7408), Name=Filesystem8, cUid=0x3f Tid=0x3f, Ref=1, Aborted=0
 | readOnly=0, umask=22, opened files/dirs=3
 | Absolute path of the share=\Filesystem8
 | NTFSExtInfo: shareFS:fsid=35, rc=43, listFS:nb=0
 | Fid=2901, FNN=0x0012d46b40(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1
 |    Notify commands received:
 |    Event=0x17, wt=1, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0x3bca, Uid=0x3f, size=0x20
 | Fid=3335, FNN=0x00193baaf0(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1\Subdirectory1
 |    Notify commands received:
 |    Event=0x17, wt=0, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0xe200, Uid=0x3f, size=0x20
 | Fid=3683, FNN=0x00290471c0(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1\Subdirectory1\Subdirectory2
 |    Notify commands received:
 |    Event=0x17, wt=0, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0x3987, Uid=0x3f, size=0x20

Making a case for file archiving

We’ve been investigating options for archiving unstructured (file based) data that resides on our Celerra for a while now. There are many options available, but before looking into a specific solution I was asked to generate a report that showed exactly how much of the data has been accessed by users for the last 60 days and for the last 12 months.  As I don’t have permissions to the shared folders from my workstation I started looking into ways to run the report directly from the Celerra control station.  The method I used will also work on VNX File.

After a little bit of digging I discovered that you can access all of the file systems from the control station by navigating to /nas/quota/slot_.  The slot_2 folder would be for the server_2 data mover, slot_3 would be for server_3, etc.  With full access to all the file systems, I simply needed to write a script that scanned each folder and counted the number of files that had been modified within a certain time window.

I always use excel for scripts I know are going to be long.  I copy the file system list from Unisphere then put the necessary commands in different columns, and end it with a concatenate formula that pulls it all together.  If you put echo -n in A1, “Users_A,” in B1, and >/home/nasadmin/scripts/Users_A.dat in C1, you’d just need to type the formula “=CONCATENATE(A1,B1,C1)” into cell D1.  D1 would then contain echo -n “Users_A,” > /home/nasadmin/scripts/Users_A.dat. It’s a simple and efficient way to make long scripts very quickly.

In this case, the script needed four different sections.  All four of these sections I’m about to go over were copied into a single shell script and saved in my /home/nasadmin/scripts directory.  After creating the .sh file, I always do a chmod +X and chmod 777 on the file.  Be prepared for this to take a very long time to run.  It of course depends on the number of file systems on your array, but for me this script took about 23 hours to complete.

First, I create a text file for each file system that contains the name of the filesystem (and a comma) which is used later to populate the first column of the final csv output.  It’s of course repeated for each file system.

echo -n "Users_A," > home/nasadmin/scripts/Users_A.dat
echo -n "Users_B," > home/nasadmin/scripts/Users_B.dat

... <continued for each filesystem>
 Second, I use the ‘find’ command to walk each directory tree and count the number of files that were accessed over 60 days ago.  The output is written to another text file that will be used in the csv output file later.
find /nas/quota/slot_2/ Users_A -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_A_wc.dat

find /nas/quota/slot_2/ Users_B -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_B_wc.dat

... <continued for each filesystem>
 Third, I want to count the total number of files in each file system.  A third text file is written with that number, again for the final combined report that’s generated at the end.
find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

... <continued for each filesystem>
 Finally, each file is combined into the final report.  The output will show each filesystem with two columns, Total Files & Files Accessed 60 days ago.  You can then easily update the report in Excel and add columns that show files accessed in the last 60 days, the percentage of files accessed in the last 60 days, etc., with some simple math.
cat /home/nasadmin/scripts/Users_A.dat /home/nasadmin/scripts/Users_A_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_A_total.dat | tr -d "\n" > /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

cat /home/nasadmin/scripts/Users_B.dat /home/nasadmin/scripts/Users_B_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_B_total.dat | tr -d "\n" >> /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

... <continued for each filesystem>

My final output looks like this:

Total Files Accessed 60+ days ago Accessed in Last 60 days % Accessed in last 60 days
Users_A            827,057                734,848               92,209                                                   11.15
Users_B              61,975                  54,727                 7,248                                                   11.70
Users_C            150,166                132,457               17,709                                                   11.79

The three example filesystems above show that only about 11% of the files have been accessed in the last 60 days.   Most user data has a very short lifecycle, it’s ‘hot’ for a month or less then dramatically tapers off as the business value of the data drops.  These file systems would be prime candidates for archiving.

My final report definitely supported the need for archving, but we’ve yet to start a project to complete it.  I like the possibility of using EMC’s cloud tiering appliance which can archive data directly to the cloud service of your choice.  I’ll make another post in the future about archiving solutions once I’ve had more time to research it.

Can’t join CIFS Server to domain – sasl protocol violation

I was running a live disaster recovery test of our Celerra CIFS Server environment last week and I was not able to get the CIFS servers to join the replica of the domain controller on the DR network.  I would get the error ‘Sasl protocol violation’ on every attempt to join the domain.

We have two interfaces configured on the data mover, one connects to production and one connects to the DR private network.  The default route on the Celerra points to the DR network and we have static routes configured for each of our remote sites in production to allow replication traffic to pass through.  Everything on the network side checked out, I could ping DC’s and DNS servers, and NTP was configured to a DR network time server and was working.

I was able to ping the DNS Server and the domain controller:

[nasadmin@datamover1 ~]$ server_ping server_2
server_2 : is alive, time= 0 ms
[nasadmin@datamover1 ~]$ server_ping server_2
server_2 : is alive, time= 3 ms

When I tried to join the CIFS Server to the domain I would get this error:

[nasadmin@datamover1 ~]$ server_cifs prod_vdm_01 -Join compname=fileserver01,domain=company.net,admin=myadminaccount -option reuse prod_vdm_01 : Enter Password:********* Error 13157007706: prod_vdm_01 : DomainJoin::connect:: Unable to connect to the LDAP service on Domain Controller ‘domaincontroller.company.net’ (@ for compname ‘fileserver01’. Result code is ‘Sasl protocol violation’. Error message is Sasl protocol violation.

I also saw this error messages during earlier tests:

Error 13157007708: prod_vdm_01 : DomainJoin::setAccountPassword:: Unable to set account password on Domain Controller ‘domaincontroller.company.net’ for compname ‘fileserver01’. Kerberos gssError is ‘Miscellaneous failure. Cannot contact any KDC for requested realm. ‘. Error message is d0000,-1765328228.

I noticed these error messages in the server log:

2012-06-21 07:03:00: KERBEROS: 3: acquire_accept_cred: Failed to get keytab entry for principal host/fileserver01.company.net@COMPANY.NET – error No principal in keytab matches desired name (39756033) 2012-06-21 07:03:00: SMB: 3: SSXAK=LOGON_FAILURE Client=x.x.x.x origin=510 stat=0x0,39756033 2012-06-21 07:03:42: KERBEROS: 5: Warning: send_as_request: Realm COMPANY.NET – KDC X.X.X.X returned error: Clients credentials have been revoked (18)

The final resolution to the problem was to reboot the data mover. EMC determined that the issue was because the kerberos keytab entry for the CIFS server was no longer valid. It could be caused by corruption or because the the machine account password expired. A reboot of the data mover causes the kerberos keytab and SPN credentials to be resubmitted, thus resolving the problem.

Undocumented Celerra / VNX File commands


The .server_config command is undocumented from EMC, I assume they don’t want customers messing with it. Use these commands at your own risk. 🙂

Below is a list of some of those undocumented commands, most are meant for viewing performance stats. I’ve had EMC support use the fcp command during a support call in the past.   When using the command for fcp stats,  I believe you need to run the ‘reset’ command first as it enables the collection of statistics.

There are likely other parameters that can be used with .server_config but I haven’t discovered them yet.

TCP Stats:

To view TCP info:
.server_config server_x -v “printstats tcpstat”
.server_config server_x -v “printstats tcpstat full”
.server_config server_x -v “printstats tcpstat reset”

Sample Output (truncated):
TCP stats :
connections initiated 8698
connections accepted 1039308
connections established 1047987
connections dropped 524
embryonic connections dropped 3629
conn. closed (includes drops) 1051582
segs where we tried to get rtt 8759756
times we succeeded 11650825
delayed acks sent 537525
conn. dropped in rxmt timeout 0
retransmit timeouts 823

SCSI Stats:

To view SCSI IO info:
.server_config server_x -v “printstats scsi”
.server_config server_x -v “printstats scsi reset”

Sample Output:
This output needs to be in a fixed width font to view properly.  I can’t seem to adjust the font, so I’ve attempted to add spaces to align it.
Ctlr: IO-pending Max-IO IO-total Idle(ms) Busy(ms) Busy(%)
0:      0         53    44925729       122348758     19159954   13%
1:      0                                           1 1 141508682       0          0%
2:      0                                           1 1 141508682       0          0%
3:      0                                           1 1 141508682       0          0%
4:      0                                           1 1 141508682       0          0%

File Stats:

.server_config server_x -v “printstats filewrite”
.server_config server_x -v “printstats filewrite full”
.server_config server_x -v “printstats filewrite reset”

Sample output (Full Output):
13108 writes of 1 blocks in 52105250 usec, ave 3975 usec
26 writes of 2 blocks in 256359 usec, ave 9859 usec
6 writes of 3 blocks in 18954 usec, ave 3159 usec
2 writes of 4 blocks in 2800 usec, ave 1400 usec
4 writes of 13 blocks in 6284 usec, ave 1571 usec
4 writes of 18 blocks in 7839 usec, ave 1959 usec
total 13310 blocks in 52397489 usec, ave 3936 usec

FCP Stats:

To view FCP stats, useful for checking SP balance:
.server_config server_x -v “printstats fcp”
.server_config server_x -v “printstats fcp full”
.server_config server_x -v “printstats fcp reset”

Sample Output (Truncated):
This output needs to be in a fixed width font to view properly.  I can’t seem to adjust the font, so I’ve attempted to add spaces to align it.
Total I/O Cmds: +0%——25%——-50%——-75%—–100%+ Total 0
FCP HBA 0 |                                                                                            | 0%  0
FCP HBA 1 |                                                                                            | 0%  0
FCP HBA 2 |                                                                                            | 0%  0
FCP HBA 3 |                                                                                            | 0%  0
# Read Cmds: +0%——25%——-50%——-75%—–100%+ Total 0
FCP HBA 0 |                                                                                            | 0% 0
FCP HBA 1 |                                                                                            | 0% 0
FCP HBA 2 |                                                                                            | 0% 0
FCP HBA 3 |  XXXXXXXXXXX                                                          | 25% 0


‘fcp’ options are:       bind …, flags, locate, nsshow, portreset=n, rediscover=n
rescan, reset, show, status=n, topology, version

‘fcp bind’ options are:  clear=n, read, rebind, restore=n, show
showbackup=n, write


Commands for ‘fcp’ operations:
fcp bind <cmd> ……… Further fibre channel binding commands
fcp flags ………….. Show online flags info
fcp locate …………. Show ScsiBus and port info
fcp nsshow …………. Show nameserver info
fcp portreset=n …….. Reset fibre port n
fcp rediscover=n ……. Force fabric discovery process on port n
Bounces the link, but does not reset the port
fcp rescan …………. Force a rescan of all LUNS
fcp reset ………….. Reset all fibre ports
fcp show …………… Show fibre info
fcp status=n ……….. Show link status for port n
fcp status=n clear ….. Clear link status for port n and then Show
fcp topology ……….. Show fabric topology info
fcp version ………… Show firmware, driver and BIOS version

Commands for ‘fcp bind’ operations:
fcp bind clear=n ……. Clear the binding table in slot n
fcp bind read ………. Read the binding table
fcp bind rebind …….. Force the binding thread to run
fcp bind restore=n ….. Restore the binding table in slot n
fcp bind show ………. Show binding table info
fcp bind showbackup=n .. Show Backup binding table info in slot n
fcp bind write ……… Write the binding table

NDMP Stats:

To Check NDMP Status:
.server_config server_x -v “printstats vbb show”

CIFS Stats:

This will output a CIFS report, including all servers, DC’s, IP’s, interfaces, Mac addresses, and more.

.server_config server_x -v “cifs”

Sample Output:

1327007227: SMB: 6: 256 Cifs threads started
1327007227: SMB: 6: Security mode = NT
1327007227: SMB: 6: Max protocol = SMB2
1327007227: SMB: 6: I18N mode = UNICODE
1327007227: SMB: 6: Home Directory Shares DISABLED
1327007227: SMB: 6: Usermapper auto broadcast enabled
1327007227: SMB: 6:
1327007227: SMB: 6: Usermapper[0] = [] state:active (auto discovered)
1327007227: SMB: 6:
1327007227: SMB: 6: Default WINS servers =
1327007227: SMB: 6: Enabled interfaces: (All interfaces are enabled)
1327007227: SMB: 6:
1327007227: SMB: 6: Disabled interfaces: (No interface disabled)
1327007227: SMB: 6:
1327007227: SMB: 6: Unused Interface(s):
1327007227: SMB: 6:  if=172-168-1-84 l= b= mac=0:60:48:1c:46:96
1327007227: SMB: 6:  if=172-168-1-82 l= b= mac=0:60:48:1c:10:5d
1327007227: SMB: 6:  if=172-168-1-81 l= b= mac=0:60:48:1c:46:97
1327007227: SMB: 6:
1327007227: SMB: 6:
1327007227: SMB: 6:  SID=S-1-5-15-7c531fd3-6b6745cb-ff77ddb-ffffffff
1327007227: SMB: 6:  DC=DCAD01( ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD02( ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD03( ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD04( ref=2 time=0 ms
1327007227: SMB: 6: >DC=SERVERDCAD01( ref=334 time=1 ms (Closest Site)
1327007227: SMB: 6: >DC=SERVERDCAD02( ref=273 time=1 ms (Closest Site)
1327007227: SMB: 6:
1327007227: UFS: 7: inc ino blk cache count: nInoAllocs 361: inoBlk 0x0219f2a308
1327007227: SMB: 6:  Full computer name=SERVERFILESEMC.DOMAIN_NAME.net realm=DOMAIN_NAME.NET
1327007227: SMB: 6:  Comment=’EMC-SNAS:T6.0.41.3′
1327007227: SMB: 6:  if=172-168-1-161 l= b= mac=0:60:48:1c:46:9c
1327007227: SMB: 6:   FQDN=SERVERFILESEMC.DOMAIN_NAME.net (Updated to DNS)
1327007227: SMB: 6:  Password change interval: 0 minutes
1327007227: SMB: 6:  Last password change: Fri Jan  7 19:25:30 2011 GMT
1327007227: SMB: 6:  Password versions: 2, 2
1327007227: SMB: 6:
1327007227: SMB: 6: CIFS Server SERVERBKUPEMC[DOMAIN_NAME] RC=2 (local users supported)
1327007227: SMB: 6:  Full computer name=SERVERbkupEMC.DOMAIN_NAME.net realm=DOMAIN_NAME.NET
1327007227: SMB: 6:  Comment=’EMC-SNAS:T6.0.41.3′
1327007227: SMB: 6:  if=172-168-1-90 l= b= mac=0:60:48:1c:10:54
1327007227: SMB: 6:   FQDN=SERVERbkupEMC.DOMAIN_NAME.net (Updated to DNS)
1327007227: SMB: 6:  Password change interval: 0 minutes
1327007227: SMB: 6:  Last password change: Thu Sep 30 16:23:50 2010 GMT
1327007227: SMB: 6:  Password versions: 2
1327007227: SMB: 6:

Domain Controller Commands:

These commands are useful for troubleshooting a windows domain controller connection issue on the control station.  Use these commands along with checking the normal server log (server_log server_2) to troubleshoot that type of problem.

To view the current domain controllers visible on the data mover:

.server_config server_2 -v “pdc dump”

Sample Output (Truncated):

1327006571: SMB: 6: Dump DC for dom='<domain_name>’ OrdNum=0
1327006571: SMB: 6: Domain=<domain_name> Next trusted domains update in 476 seconds1327006571: SMB: 6:  oldestDC:DomCnt=1,179531 Time=Sat Oct 15 15:32:14 2011
1327006571: SMB: 6:  Trusted domain info from DC='<Windows_DC_Servername>’ (423 seconds ago)
1327006571: SMB: 6:   Trusted domain:<domain_name>.net [<Domain_Name>]
1327006571: SMB: 6:    Flags=0x20 Ix=0 Type=0x2 Attr=0x0
1327006571: SMB: 6:    SID=S-1-5-15-d1d612b1-87382668-9ba5ebc0
1327006571: SMB: 6:    DC=’-‘
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x547,1355
1327006571: SMB: 6:   Trusted domain: <Domain_Name>
1327006571: SMB: 6:    Flags=0x22 Ix=0 Type=0x1 Attr=0x1000000
1327006571: SMB: 6:    SID=S-1-5-15-76854ac0-4c527104-321d5138
1327006571: SMB: 6:    DC=’\\<Windows_DC_Servername>’
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x0,0
1327006571: SMB: 6:   Trusted domain:<domain_name>.net [<domain_name>]
1327006571: SMB: 6:    Flags=0x20 Ix=0 Type=0x2 Attr=0x0
1327006571: SMB: 6:    SID=S-1-5-15-88d60754-f3ed4f9d-b3f2cbc4
1327006571: SMB: 6:    DC=’-‘
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x547,1355
DC=DC0x0067a82c18 <Windows_DC_Servername>[<domain_name>]( ref=2 time(getdc187)=0 ms LastUpdt=Thu Jan 19 20:45:14 2012
    Pid=1000 Tid=0000 Uid=0000
    Cnx=UNSUCCESSFUL,DC state Unknown
    logon=Unknown 0 SecureChannel(s):
    Capa=0x0 Nego=0x0000000000,L=0 Chal=0x0000000000,L=0,W2kFlags=0x0
    refCount=2 newElectedDC=0x0000000000 forceInvalid=0
    Discovered from: WINS

To enable or disable a domain controller on the data mover:

.server_config server_2 -v “pdc enable=<ip_address>”  Enable a domain controller

.server_config server_2 -v “pdc disable=<ip_address>”  Disable a domain controller


 .server_config server_2 -v “meminfo”

Sample Output (truncated):

3552907011 calls to malloc, 3540029263 to free, 61954 to realloc
Size     In Use       Free      Total nallocs nfrees
16       3738        870       4608   161720370   161716632
32      18039      17289      35328   1698256206   1698238167
64       6128       3088       9216   559872733   559866605
128       6438      42138      48576   255263288   255256850
256       8682      19510      28192   286944797   286936115
512       1507       2221       3728   357926514   357925007
1024       2947       9813      12760   101064888   101061941
2048       1086        198       1284    5063873    5062787
4096         26        138        164    4854969    4854943
8192        820         11        831   19562870   19562050
16384         23         10         33       5676       5653
32768          6          1          7        101         95
65536         12          0         12         12          0
524288          1          0          1          1          0
Total Used     Total Free    Total Used + Free
all sizes   18797440   23596160   42393600


.server_config server_2 -v “help memowners”

memowners [dump | showmap | set … ]

memowners [dump] – prints memory owner description table
memowners showmap – prints a memory usage map
memowners memfrag [chunksize=#] – counts free chunks of given size
memowners set priority=# tag=# – changes dump priority for a given tag
memowners set priority=# label=’string’ – changes dump priority for a given label
The priority value can be set to 0 (lowest) to 7 (highest).

Sample Output (truncated):

1408979513: KERNEL: 6: Memory_Owner dump.
nTotal Frames 1703936 Registered = 75,  maxOwners = 128
1408979513: KERNEL: 6:   0 (   0 frames) No owner, Dump priority 6
1408979513: KERNEL: 6:   1 (3386 frames) Free list, Dump priority 0
1408979513: KERNEL: 6:   2 (40244 frames) malloc heap, Dump priority 6
1408979513: KERNEL: 6:   3 (6656 frames) physMemOwner, Dump priority 7
1408979513: KERNEL: 6:   4 (36091 frames) Reserved Mem based on E820, Dump priority 0
1408979513: KERNEL: 6:   5 (96248 frames) Address gap based on E820, Dump priority 0
1408979513: KERNEL: 6:   6 (   0 frames) Rmode isr vectors, Dump priority 7

Celerra replication monitoring script

This script allows me to quickly monitor and verify the status of my replication jobs every morning.  It will generate a csv file with six columns for file system name, interconnect, estimated completion time, current transfer size,current transfer size remaining, and current write speed.

I recently added two more remote offices to our replication topology and I like to keep a daily tab on how much longer they have to complete the initial seeding, and it will also alert me to any other jobs that are running too long and might need my attention.

Step 1:

Log in to your Celerra and create a directory for the script.  I created a subdirectory called “scripts” under /home/nasadmin.

Create a text file named ‘replfs.list’ that contains a list of your replicated file systems.  You can cut and paste the list out of Unisphere.

The contents of the file should should look something like this:

 Step 2:

Copy and paste all of the code into a text editor and modify it for your needs (the complete code is at the bottom of this post).  I’ll go through each section here with an explanation.

1: The first section will create a text file ($fs.dat) for each filesystem in the replfs.list file you made eariler.

for fs in `cat replfs.list`
         nas_replicate -info $fs | egrep 'Celerra|Name|Current|Estimated' > $fs.dat
 The output will look like this:
Name                                        = Filesystem_01
Source Current Data Port            = 57471
Current Transfer Size (KB)          = 232173216
Current Transfer Remain (KB)     = 230877216
Estimated Completion Time        = Thu Nov 24 06:06:07 EST 2011
Current Transfer is Full Copy      = Yes
Current Transfer Rate (KB/s)       = 160
Current Read Rate (KB/s)           = 774
Current Write Rate (KB/s)           = 3120
 2: The second section will create a blank csv file with the appropriate column headers:
echo 'Name,System,Estimated Completion Time,Current Transfer Size (KB),Current Transfer Remain (KB),Write Speed (KB)' > replreport.csv

3: The third section will parse all of the output files created by the first section, pulling out only the data that we’re interested in.  It places it in columns in the csv file.

         for fs in `cat replfs.list`


         echo $fs","`grep Celerra $fs.dat | awk '{print $5}'`","`grep -i Estimated $fs.dat |awk '{print $5,$6,$7,$8,$9,$10}'`","`grep -i Size $fs.dat |awk '{print $6}'`","`grep -i Remain $fs.dat |awk '{print $6}'`","`grep -i Write $fs.dat |awk '{print $6}'` >> replreport.csv

 If you’re not familiar with awk, I’ll give a brief explanation here.  When you grep for a certain line in the output code, awk will allow you to output only one word in the line.

For example, if you want the output of “Yes” put into a column in the csv file, but the output code line looks like “Current Transfer is Full Copy      = Yes”, then you could pull out only the “Yes” by typing in the following:

 nas_replicate -info Filesystem01 | grep  Full | awk '{print $7}'

Because the word ‘Yes’ is the 7th item in the line, the output would only contain the word Yes.

4: The final section will send an email with the csv output file attached.

uuencode replreport.csv replreport.csv | mail -s "Replication Status Report" user@domain.com

Step 3:

Copy and paste the modified code into a script file and save it.  I have mine saved in the /home/nasadmin/scripts folder. Once the file is created, make it executable by typing in chmod +X scriptfile.sh, and change the permissions with chmod 755 scriptfile.sh.

Step 4:

You can now add the file to crontab to run automatically.  Add it to cron by typing in crontab –e, to view your crontab entries type crontab –l.  For details on how to add cron entries, do a google search as there is a wealth of info available on your options.

Script Code:

for fs in `cat replfs.list`


         nas_replicate -info $fs | egrep 'Celerra|Name|Current|Estimated' > $fs.dat


 echo 'Name,System,Estimated Completion Time,Current Transfer Size (KB),Current Transfer Remain (KB),Write Speed (KB)' > replreport.csv

         for fs in `cat replfs.list`


         echo $fs","`grep Celerra $fs.dat | awk '{print $5}'`","`grep -i Estimated $fs.dat |awk '{print $5,$6,$7,$8,$9,$10}'`","`grep -i Size $fs.dat |awk '{print $6}'`","`grep -i Remain $fs.dat |awk '{print $6}'`","`grep -i Write $fs.dat |awk '{print $6}'` >> replreport.csv


 uuencode replreport.csv replreport.csv | mail -s "Replication Status Report" user@domain.com
 The final output of the script generates a report that looks like the sample below.  Filesystems that have all zeros and no estimated completion time are caught up and not currently performing a data synchronization.
Name System Estimated Completion Time Current Transfer Size (KB) Current Transfer Remain (KB) Write Speed (KB)
SA2Users_03 SA2VNX5500 0 0 0
SA2Users_02 SA2VNX5500 Wed Dec 16 01:16:04 EST 2011 211708152 41788152 2982
SA2Users_01 SA2VNX5500 Wed Dec 16 18:53:32 EST 2011 229431488 59655488 3425
SA2CommonFiles_04 SA2VNX5500 0 0 0
SA2CommonFiles_03 SA2VNX5500 Wed Dec 16 10:35:06 EST 2011 232173216 53853216 3105
SA2CommonFiles_02 SA2VNX5500 Mon Dec 14 15:46:33 EST 2011 56343592 12807592 2365
SA2commonFiles_01 SA2VNX5500 0 0 0

VMWare/ESX can’t write to a Celerra Read/Write NFS mounted datastore

I had just created serveral new Celerra NFS mounted datastores for our ESX administrator.  When he tried to create new VM hosts using the new datastores, he would get this error:   Call “FileManager.MakeDirectory” for object “FileManager” on vCenter Server “servername.company.com” failed.

Searching for that error message on powerlink, the VMWare forums, and general google searches didn’t bring back any easy answers or solutions.  It looked like ESX was unable to write to the NFS mount for some reason, even though it was mounted as Read/Write.  I also had the ESX hosts added to the R/W access permissions for the NFS export.

After much digging and experimentation, I did resolve the problem.  Here’s what you have to check:

1. The VMKernel IP must be in the root hosts permissions on the NFS export.   I put in the IP of the ESX server along with the VMKernel IP.

2. The NFS export must be mounted with the no_root_squash option.  By default, the root user with UID 0 is not given access to an NFS volume, mounting the export with no_root_squash allows the root user access.  The VMkernal must be able to access the NFS volume with UID 0.

I first set up the exports and permissions settings in the GUI, then went to the CLI to add the mount options.
command:  server_mount server_2 -option rw,uncached,sync,no_root_squash <sharename> /<sharename>

3. From within the ESX Console/Virtual Center, the Firewall settings should be updated to add the NFS Client.   Go to ‘Configuration’ | ‘Security Profile’ | ‘Properties’ | Click the NFS Client checkbox.

4. One other important item to note when adding NFS mounted datastores is the default limit of 8 in ESX.  You can increase the limit by going to ‘Configuration’ | ‘Advanced Settings’ | ‘NFS’ in the left column | Scroll to ‘NFS.MaxVolumes’ on the left, increase the number up to 64.  If you try to add a new datastore above the NFS.MaxVolumes limit, you will get the same error in red at the top of this post.

That’s it.  Adding the VMKernel IP to the root permissions, mounting with no_root_squash, and adding the NFS Client to ESX resolved the problem.

Use the CLI to quickly determine the size of your Celerra checkpoint filesystems

Need to quickly figure out which checkpoint filesystems are taking up all of your precious savvol space?  Run the CLI command below.  Filling up the savvol storage pool can cause all kinds of problems besides failing checkpoints.  It can also cause filesystem replication jobs to fail.

To view it on the screen:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size

To save it in a file:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size > checkpoints.txt

vi checkpoints.txt   (to view the file)

Here’s a sample of the output:

ckpt_ckpt_UserFilesystem_01_monthly_001 :   836 : 220000
ckpt_ckpt_UserFilesystem_01_monthly_002 :   649 : 220000

ckpt_ckpt_UserFilesystem_02_monthly_001 :   836 : 80000
ckpt_ckpt_UserFilesystem_02_monthly_002 :   649 : 80000

The numbers are in MB.


Unable to provision Celerra storage?

This one really made no sense to me at first.  I was attempting to create a new file storage pool on our NS960 Celerra.  Upon launching the disk provisioning wizard, it would pause for a minute and then give the following error:

“ERROR: Unable to continue provisioning.  click for details”

It was strange because I have plenty of disks that could be used for provisioning.  Why wasn’t it working?

Here is the detailed error message:

Message Details:

Message: Unable to continue provisioning

Full Description:  Not able to fetch disk information. n Command Failed, error code: 1, output: errormessage:string=”Timeout (60 seconds) waiting for state SS_DISKS_LOADED”

Recommended Action: No recommended action is available. Go to http://powerlink.EMC.com for more information.

Event Code: 15301214354

As a workaround and to test the issue, I used 8 spare SATA drives and created a new raid group, with one large LUN using all of the space.  I added it to the Celerra storage group and rescanned the SAN for storage.

The following error popped up:

Brief Description:  Invalid credentials for the storage array APM01034413494. 
Full Description:  The FLARE version on this storage array requires secure communication. Saved credentials are found, but authentication failed. The Control Station does not have valid credentials. 
Recommended Action:  Set the credentials by running the “nas_storage -modify <backend_name> -security” command. 
Message ID:  13422231564 

Well, how about that.  An error message that actually gives you the command to resolve the problem. 🙂  As it turns out, one of our other SAN administrators had changed the password for the system account.  Running nas_storage -modify id=<xx> -security resolved the problem.

(Note: You can get the ID number by running nas_storage -list)

DM Interconnect failure with Celerra Replicator

We just installed a new VNX 5500 a few weeks ago in the UK, and i intially set up a VDM replication job between it and it’s replication partner, an NS-960 in Canada.  The setup went fine with no errors, and replication of the VDM has completed successfully every day up until yesterday when I noticed that the status on the main replications screen says “network communication has been lost”.   I am able to use the server_ping command to ping the data mover/replication interface from UK to Canada, so network connectivity appears to be ok.

I was attempting to set up new replication jobs for the filesystems on this VDM, and the background tasks to create the replication jobs are stuck at “Establishing communication with secondary side for Create task” with a status of “Incomplete”.

I went to the DM interconnect next to validate that it was working, and the validation test failed with the following message: “Validate Data Mover Interconnect server_2:<SAN_name>. The following interfaces cannot connect: source interface=10.x.x.x destination interface=10.x.x.x, Message_ID=13160415446: Authentication failed for DIC communication.”

So, why is the DM Interconnect is failing?   It was working fine for several weeks!

My next trip was to the server log (>server_log server_2) where I spotted another issue.  Hundreds of entries that looked just like these:

2011-07-07 16:32:07: CMD: 6: CmdReplicatev2ReversePri::startSecondary dicSt 16 cmdSt 214
2011-07-07 16:32:10: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:10: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16
2011-07-07 16:32:12: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:12: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16

Bad Authentication? Hmmm.  There is something amiss with the trusted relationship between the VNX and the NS960.  I did a quick read of EMC’s VNX replication manual (yep, rtfm!) and found the command to update the interconnect, nas_cel.

First, run nas_cel -list to view all of your interconnects, noting the ID number of the one you’re having difficulty with.

[nasadmin@<name> ~]$ nas_cel -list
id    name          owner mount_dev  channel    net_path                                      CMU
0     <name_1>  0                               10.x.x.x                                   APM007039002350000
2     <name_2>      0                           10.x.x.x                                   APM001052420000000
4     <name_3>      0                           10.x.x.x                                   APM009015016510000
5     <name_4>       0                           10.x.x.x                                  APM000827205690000

In this case, I was having trouble with <name_3>, which is ID 4.

Run this command next:  nas_cel -update id=4.   After that command completed, my interconnect immediately started working and I was able to create new replication jobs.

Celerra Health Check with CLI Commands

Here are the first commands I’ll type when I suspect there is a problem with the Celerra, or if I want to do a simple health check.

1. <watch> /nas/sbin/getreason.  This will quickly give you the current status of each data mover. 5=up, 0=down/rebooting.  Typing watch before the command will run the command with continuous updates so you can monitor a datamover if you are purposely rebooting it.

10 – slot_0 primary control station
5 – slot_2 contacted
5 – slot_3 contacted

2. nas_server -list.  This lists all of the datamovers and their current state.  It’s a good way to quickly tell which datamovers are active and which are standby.

1=nas, 2=unused, 3=unused, 4=standby, 5=unused, 6=rdf

id      type  acl  slot groupID  state  name
1        1    0     2                         0    server_2
2        4    0     3                        0    server_3

3. server_sysstat.  This will give you a quick overview of memory and CPU utilization.

server_2 :
threads runnable = 6
threads blocked  = 4001
threads I/J/Z    = 1
memory  free(kB) = 2382807
cpu     idle_%   = 70

4. nas_checkup.   This runs a system health check.

Check Version:
Check Command:  /nas/bin/nas_checkup
Check Log    :  /nas/log/checkup-run.110608-143203.log

Control Station: Checking if file system usage is under limit………….. Pass
Control Station: Checking if NAS Storage API is installed correctly…….. Pass

5. server_log server_2.  This shows the current alert log.  Alert logs are also stored in /nas/log/webui.

6. vi /nas/jserver/logs/system_log.   This is the java system log.

7. vi /var/log/messages.  This displays system messages.

Easy File Extension filtering with EMC Celerra

Are your users filling up your CIFS fileserver with MP3 files?  Sick of sending out emails outlining IT policies, asking for their removal?  However your manage it now, the best way to avoid the problem in the first place is to set up filtering on your CIFS server file shares.

So, to use the same example, lets say you don’t want your users to store MP3 files on your \\PRODFILES\Public share.

1. Navigate to the \\PRODFILES\C$ administrative share.

2. Open the folder in the root directory called .filefilter

3. Create an empty text file called mp3@public in the .filefilter folder.

4. Change the windows security on the file to restrict access to certain active directory groups or individuals.

That’s it!  Once the file is created and security is set, users who are restricted by the file security will no longer be able to copy MP3 files to the public share.  Note that this will not remove any existsing MP3 files from the share, it will only prevent new ones from being copied.

A guide for troubleshooting CIFS issues on the Celerra

In my experience, every CIFS issue you may have will fall into 8 basic areas, the first five being the most common.   Check all of these things and I can almost guarantee you will resolve your problem. 🙂

1. CIFS Service.  Check and make sure the CIFS Service is running:  server_cifs server_2 -protocol CIFS -option start

2. DNS.  Check and make sure that your DNS server entries on the Celerra are correct, that you’re configured to point to at least two, and that they are up and running with the DNS Service running.

3. NTP.  Make sure your NTP server entry is correct on the Celerra, and that the IP is reachable on the network and is actively providing NTP services.

4. User Mapping.

5. Default Gateway.  Double check your default gateway in the Celerra’s routing table.  Get the network team involved if you’re not sure.

6. Interfaces.  Make sure the interfaces are physically connected and properly configured.

7. Speed/Duplex.  Make sure the speed and duplex settings on the Celerra match those of the switch port that the interfaces are plugged in to.

8. VLAN.  Double check your VLAN settings on the interfaces, make sure it matches what is configured on the connected switch.

Filesystem Alignment

You’re likely to have seen the filesystem alignment check fail on most, if not all, of the EMC HEAT reports that you run on your windows 2003 servers.  The starting offset for partition 1 should optimally be a multiple of 128 sectors.  So, how do you fix this problem, and what does it mean?

If you align the partition to 128 blocks (or 64KB as each block is 512bytes) then you don’t cross a track boundary and thereby issue the minimum number of IOs.   Issuing the minimum number of IOs sounds good, right? 🙂

Because NTFS reserves 31.5 KB of signature space, if a LUN has an element size of 64 KB with the default alignment offset of 0 (both are default Navisphere settings), a 64 KB write to that LUN would result in a disk crossing even though it would seem to fit perfectly on
the disk.  A disk crossing can also be referred to as a split IO because the read or write must be split into two or more segments. In this case, 32.5 KB would be written to the first disk and 31.5 KB would be written to the following disk, because the beginning of the stripe is offset by 31.5 KB of signature space. This problem can be avoided by providing the correct alignment offset.  Each alignment offset value represents one block.  Therefore, EMC recommends setting the alignment offset value to 63, because 63 times 512 bytes is 31.5 KB.

Checking your offset:

1. Launch System Information in windows (msinfo32.exe)

2. Select Components -> Storage -> Disks.

3. Scroll to the bottom and you will see the partition starting offset information.  This number needs to be perfectly divisible by 4096, if it’s not then your partition is not properly aligned.

Correcting your starting offset:

Launch diskpart:


DISKPART> list disk

Two disks should be listed

DISKPART> select disk 1

This selects the second disk drive

DISKPART> list partitions

This step should give a message “There are no partitions on this disk to show”.  This confirms a blank disk.

DISKPART> create partition primary align=64

That’s it.  You now have a perfectly aligned disk.