Category Archives: Celerra / VNX File specific

Using Cron with EMC VNX and Celerra

I’ve shared numerous shell scripts over the years on this blog, many of which benefit from being scheduled to run automatically on the Control Station.  I’ve received emails and comments asking “How do I schedule Unix or Linux crontab jobs to run at intervals like every five minutes, every ten minutes, Every hour, etc.”?  I’ll run through some specific examples next. While it’s easy enough to simply type “man crontab” from the CLI to review the syntax, it can be helpful to see specific examples.

What is cron?

Cron is a time-based job scheduler used in most Unix operating systems, including the VNX File OS (or DART). It’s used schedule jobs (either commands or shell scripts) to run periodically at fixed times, dates, or intervals. It’s primarily used to automate system maintenance, administration, and can also be used for troubleshooting.

What is crontab?

Cron is driven by a crontab (cron table) file, a configuration file that specifies shell commands to run periodically on a given schedule. The crontab files are stored where the lists of jobs and other instructions to the cron daemon are kept. Users can have their own individual crontab files and often there is a system-wide crontab file (usually in /etc or a subdirectory of /etc) that only system administrators can edit.

On the VNX, the crontab files are located at /var/spool/cron but are not intended to be edited directly, you should use the “crontab –e” command.  Each NAS user has their own crontab, and commands in any given crontab will be executed as the user who owns the crontab.  For example, the crontab file for nasadmin is stored as /var/spool/cron/nasadmin.

Best Practices and Requirements

First, let’s review how to edit and list crontab and some of the requirements and best practices for using it.

1. Make sure you use “crontab -e” to edit your crontab file. As a best practice you shouldn’t edit the crontab files directly.  Use “crontab -l” to list your entries.

2. Blank lines and leading spaces and tabs are ignored. Lines whose first non-space character is a pound sign (#) are comments, and are ignored. Comments are not allowed on the same line as cron commands as they will be taken to be part of the command.

3. If the /etc/cron.allow file exists, then you must be listed therein in order to be allowed to use this command. If the /etc/cron.allow file does not exist but the /etc/cron.deny file does exist, then you must not be listed in the /etc/cron.deny file in order to use this command.

4. Don’t execute commands directly from within crontab, place your commands within a script file that’s called from cron. Crontab cannot accept anything on stdout, which is one of several reasons you shouldn’t put commands directly in your crontab schedule.  Make sure to redirect stdout somewhere, either a log file or /dev/null.  This is accomplished by adding “> /folder/log.file” or “> /dev/null” after the script path.

5. For scripts that will run under cron, make sure you either define actual paths or use fully qualified paths to all commands that you use in the script.

6. I generally add these two lines to the beginning of my scripts as a best practice for using cron on the VNX.

export NAS_DB=/nas
export PATH=$PATH:/nas/bin

Descriptions of the crontab date/time fields

Commands are executed by cron when the minute, hour, and month of year fields match the current time, and when at least one of the two day fields (day of month, or day of week) match the current time.

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of week (0 - 7) 
# │ │ │ │ │                          
# │ │ │ │ │
# │ │ │ │ │
# * * * * *  command to execute
# 1 2 3 4 5  6

field #   meaning          allowed values
-------   ------------     --------------
   1      minute           0-59
   2      hour             0-23
   3      day of month     1-31
   4      month            1-12
   5      day of week      0-7 (0 or 7 is Sun)

Run a command every minute

While it’s not as common to want to run a command every minute, there can be specific use cases for it.  It would most likely be used when you’re in the middle of troubleshooting an issue and need data to be recorded more frequently.  For example, you may want to run a command every minute to check and see if a specific process is running.  To run a Unix/Linux crontab command every minute, use this syntax:

# Run “check.sh” every minute of every day
* * * * * /home/nasadmin/scripts/check.sh

Run a command every hour

The syntax is similar when running a cron job every hour of every day.  In my case I’ve used hourly scripts for performance monitoring, for example with the server_stats VNX script. Here’s a sample crontab entry that runs at 15 minutes past the hour, 24 hours a day.

# Brocade Backup
# This command will run at 12:15, 1:15, 2:15, etc., 24 hours a day.
15 * * * * /home/nasadmin/scripts/stat_collect.sh

Run a command once a day

Here’s an example that shows how to run a command from the cron daemon once a day. In my case, I’ll usually run daily commands for report updates on our web page and for backups.  As an example, I run my Brocade Zone Backup script once daily.

# Run the Brocade backup script at 7:30am
30 7 * * * /home/nasadmin/scripts/brocade.sh

Run a command every 5 minutes

There are multiple methods to run a crontab entry every five minutes.  It is possible to enter a single, specific minute value multiple times, separated by commas.  While this method does work, it makes the crontab list a bit harder to read and there is a shortcut that you can use.

0,5,10,15,20,25,30,35,40,45,50,55  * * * * /home/nasadmin/scripts/script.sh

The crontab “step value” syntax (using a forward slash) allows you use a crontab entry in the format sample below.  It will run a command every five minutes and accomplish the same thing as the command above.

# Run this script every 5 minutes
*/5 * * * * /home/nasadmin/scripts/test.sh

Ranges, Lists, and Step Values

I just demonstrated the use of a step value to specify a schedule of every five minutes, but you can actually get even more granular that that using ranges and lists.

Ranges.  Ranges of numbers are allowed (two numbers separated with a hyphen). The specified range is inclusive. For example, using 7-10 for an “hours” entry specifies execution at hours 7, 8, 9, & 10.

Lists. A list is a set of numbers (or ranges) separated by commas. Examples: “1,2,5,9”, “0-4,8-12”.

Step Values. Step values can be used in conjunction with ranges. Following a range with “/” specifies skips of the number’s value through the range. For example, “0-23/2” can be used in the hours field to specify command execution every other hour (the alternative being “0,2,4,6,8,10,12,14,16,18,20,22”). Steps are also permitted after an asterisk, so if you want to say “every two hours” you can use “*/2”.

Special Strings

While I haven’t personally used these, there is a set of built in special strings you can use, outlined below.

string         meaning
------         -------
@reboot        Run once, at startup.
@yearly        Run once a year, "0 0 1 1 *".
@annually      (same as @yearly)
@monthly       Run once a month, "0 0 1 * *".
@weekly        Run once a week, "0 0 * * 0".
@daily         Run once a day, "0 0 * * *".
@midnight      (same as @daily)
@hourly        Run once an hour, "0 * * * *".

Using a Template

Below is a template you can use in your crontab file to assist with the valid values that can be used in each column.

# Minute|Hour  |Day of Month|Month |WeekDay |Command
# (0-59)|(0-23)|(1-31)      |(1-12)|(0-7)             
  0      2      12           *      *        test.sh

Gotchas

Here’s a list of the known limitations of cron and some of the issues you may encounter.

1. When cron job is run it is executed as the user that created it. Verify security requirements for the job.

2. Cron jobs do not use any files in the user’s home directory (like .cshrc or .bashrc). If you need cron to read any file that your script will need, you will need to call it from the script cron is using. This includes setting paths, sourcing files, setting environment variables, etc.

3. If your cron jobs are not running, make sure the cron daemon is running. The cron daemon can be started or stopped with the following VNX Commands (run as root):

# /sbin/service crond stop
 # /sbin/service crond start

4.  If your job isn’t running properly you should also check the /etc/cron.allow and /etc/cron.deny files.

5. Crontab is not parsed for environmental substitutions. You can not use things like $PATH, $HOME, or ~/sbin.

6. Cron does not deal with seconds, minutes is the most granular it allows.

7. You can not use % in the command area. They will need to be escaped and if used with command substitution like the date command you can put it in backticks. Ex. `date +\%Y-\%m-\%d`. Or use bash’s command substitution $().

8. Be cautious using the day of the month and the day of week together.  The day of month and day of week fields with restrictions (no *) makes this an “or” condition not an “and” condition.  When either field is true it will be executed.

 

Advertisements

Scripting a VNX/Celerra to Isilon Data Migration with EMCOPY and Perl

datamigration

Below is a collection of perl scripts that make data migration from VNX/Celerra file systems to an Isilon system much easier.  I’ve already outlined the process of using isi_vol_copy_vnx in a prior post, however using EMCOPY may be more appropriate in a specific use case, or simply more familiar to administrators for support and management of the tool.  Note that while I have tested these scripts in my environment, they may need some modification for your use.  I recommend running them in a test environment prior to using them in production.

EMCOPY can be downloaded directly from DellEMC with the link below.  You will need to be a registered user in order to download it.

https://download.emc.com/downloads/DL14101_EMCOPY_File_migration_tool_4.17.exe

What is EMCOPY?

For those that haven’t used it before, EMCOPY is an application that allows you to copy a file, directory, and subdirectories between NTFS partitions while maintaining security information, an improvement over the similar robocopy tool that many veteran system administrators are familiar with. It allows you to back up the file and directory security ACLs, owner information, and audit information from a source directory to a destination directory.

Notes about using EMCOPY:

1) In my testing, EMCopy has shown up to a 25% performance improvement when copying CIFS data compared to Robocopy while using the same number of threads. I recommend using EMCopy over Robocopy as it has other feature improvements as well, for instance sidmapfile, which allows migrating local user data to Active Directory users. It’s available in version 4.17 or later.  Robocopy is also not an EMC supported tool, while EMCOPY is.

2) Unlike isi_vol_copy_vnx, EMCOPY is a windows application and must be run from a windows host.  I highly recommend a dedicated server for any migration tasks.  The isi_vol_copy_vnx utility runs directly on the Isilon OneFS CLI which eliminates any intermediary copy hosts, theoretically providing a much faster solution.

3) There are multiple methods to compare data sizes between the source and destination. I would recommend maintaining a log of each EMCopy session as that log indicates how much data was copied and if there were any errors.

4) If you are migrating over a WAN connection, I recommend first restoring from tape and then using an incremental data sync with EMCOPY.

Getting Started

I’ve divided this post up into a four step process.  Each step includes the relevant script and a description of the process.

  • Export File System information (export_fs.pl  Script)

Export file system information from the Celerra & generate the Isilon commands to re-create them.

  • Export SMB information (export_smb.pl Script)

Export SMB share information from the Celerra & generate the Isilon commands to re-create them.

  • Export NFS information (export_nfs.pl Script)

Export NFS information from the Celerra & generate the Isilon commands to re-create them.

  • Create the EMCOPY migration script (EMCOPY_create.pl Script)

Perform the data migration with EMCOPY using the output from this script.

Exporting information from the Celerra to run on the Isilon

These Perl scripts are designed to be run directly on the Control Station and will subsequently create shell scripts that will run on the Isilon to assist with the migration.  You will need to manually copy the output files from the VNX/Celerra to the Isilon. The first three steps I’ve outlined do not move the data or permissions, they simply run a nas_fs query on the Celerra to generate the Isilon script files that actually make the directories, create quotas, and create the NFS and SMB shares. They are “scripts that generate scripts”. 🙂

Before you run the scripts, make sure you edit them to correctly specify the appropriate Data Mover.  Once complete, You’ll end up with three .sh files created for you to move to your Isilon cluster.  They should be run in the same order as they were created.

Note that EMC occasionally changes the syntax of certain commands when they update OneFS.  Below is a sample of the isilon specific commands that are generated by the first three scripts.  I’d recommend verifying that the syntax is still correct with your version of OneFS, and then modify the scripts if necessary with the new syntax.  I just ran a quick test with OneFS 8.0.0.2, and the base commands and switches appear to be compatible.

isi quota create –directory –path=”/ifs/data1″ –enforcement –hard-threshold=”1032575M” –container=1
isi smb share create –name=”Data01″ –path=”/ifs/Data01/data”
isi nfs exports create –path=”/Data01/data”  –roclient=”Data” –rwclient=”Data” –rootclient=”Data”

 

Step 1 – Export File system information

This script will generate a list of the file system names from the Celerra and place the appropriate Isilon commands that create the directories and quotes into a file named “create_filesystems_xx.sh”.

#!/usr/bin/perl

# Export_fs.pl – Export File system information
# Export file system information from the Celerra & generate the Isilon commands to re-create them.

use strict;
my $nas_fs="nas_fs -query:inuse=y:type=uxfs:isroot=false -fields:ServersNumeric,Id,Name,SizeValues -format:'%s,%s,%s,%sQQQQQQ'";
my @data;

open (OUTPUT, ">> create_filesystems_$$.sh") || die "cannot open output: $!\n\n";
open (CMD, "$nas_fs |") || die "cannot open $nas_fs: $!\n\n";

while ()

{
   chomp;
   @data = split("QQQQQQ", $_);
}

close(CMD);
foreach (@data)

{
   my ($dm, $id, $dir,$size,$free,$used_per, $inodes) = split(",", $_);
   print OUTPUT "mkdir /ifs/$dir\n";
   print OUTPUT "chmod 755 /ifs/$dir\n";
   print OUTPUT "isi quota create --directory --path=\"/ifs/$dir\" --enforcement --hard-threshold=\"${size}M\" --container=1\n";
}

The Output of the script looks like this (this is an excerpt from the create_filesystems_xx.sh file):

isi quota create --directory --path="/ifs/data1" --enforcement --hard-threshold="1032575M" --container=1
mkdir /ifs/data1
chmod 755 /ifs/data1
isi quota create --directory --path="/ifs/data2" --enforcement --hard-threshold="20104M" --container=1
mkdir /ifs/data2
chmod 755 /ifs/data2
isi quota create --directory --path="/ifs/data3" --enforcement --hard-threshold="100774M" --container=1
mkdir /ifs/data3
chmod 755 /ifs/data3

The output script can now be copied to and run from the Isilon.

Step 2 – Export SMB Information

This script will generate a list of the smb share names from the Celerra and place the appropriate Isilon commands into a file named “create_smb_exports_xx.sh”.

#!/usr/bin/perl

# Export_smb.pl – Export SMB/CIFS information
# Export SMB share information from the Celerra & generate the Isilon commands to re-create them.

use strict;

my $datamover = "server_8";
my $prot = "cifs";:wq!
my $nfs_cli = "server_export $datamover -list -P $prot -v |grep share";

open (OUTPUT, ">> create_smb_exports_$$.sh") || die "cannot open output: $!\n\n";
open (CMD, "$nfs_cli |") || die "cant open $nfs_cli: $!\n\n";

while ()
{
   chomp;
   my (@vars) = split(" ", $_);
   my $path = $vars[2];
   my $name = $vars[1];

   $path =~ s/^"/\"\/ifs/;
   print  OUTPUT "isi smb share create --name=$name --path=$path\n";
}

close(CMD);

The Output of the script looks like this (this is an excerpt from the create_smb_exports_xx.sh file):

isi smb share create --name="Data01" --path="/ifs/Data01/data"
isi smb share create --name="Data02" --path="/ifs/Data02/data"
isi smb share create --name="Data03" --path="/ifs/Data03/data"
isi smb share create --name="Data04" --path="/ifs/Data04/data"
isi smb share create --name="Data05" --path="/ifs/Data05/data"

 The output script can now be copied to and run from the Isilon.

Step 3 – Export NFS Information

This script will generate a list of the NFS export names from the Celerra and place the appropriate Isilon commands into a file named “create_nfs_exports_xx.sh”.

#!/usr/bin/perl

# Export_nfs.pl – Export NFS information
# Export NFS information from the Celerra & generate the Isilon commands to re-create them.

use strict;

my $datamover = "server_8";
my $prot = "nfs";
my $nfs_cli = "server_export $datamover -list -P $prot -v |grep export";

open (OUTPUT, ">> create_nfs_exports_$$.sh") || die "cannot open output: $!\n\n";
open (CMD, "$nfs_cli |") || die "cant open $nfs_cli: $!\n\n";

while ()
{
   chomp;
   my (@vars) = split(" ", $_);
   my $test = @vars;
   my $i=2;
   my ($ro, $rw, $root, $access, $name);
   my $path=$vars[1];

   for ($i; $i < $test; $i++)
   {
      my ($type, $value) = split("=", $vars[$i]);

      if ($type eq "ro") {
         my @tmp = split(":", $value);
         foreach(@tmp) { $ro .= " --roclient=\"$_\""; }
      }
      if ($type eq "rw") {
         my @tmp = split(":", $value);
         foreach(@tmp) { $rw .= " --rwclient=\"$_\""; }
      }

      if ($type eq "root") {
         my @tmp = split(":", $value);
         foreach(@tmp) { $root .= " --rootclient=\"$_\""; }
      }

      if ($type eq "access") {
         my @tmp = split(":", $value);
         foreach(@tmp) { $ro .= " --roclient=\"$_\""; }
      }

      if ($type eq "name") { $name=$value; }
   }
   print OUTPUT "isi nfs exports create --path=$path $ro $rw $root\n";
}

close(CMD);

The Output of the script looks like this (this is an excerpt from the create_nfs_exports_xx.sh file):

isi nfs exports create --path="/Data01/data" --roclient="Data" --roclient="BACKUP" --rwclient="Data" --rwclient="BACKUP" --rootclient="Data" --rootclient="BACKUP"
isi nfs exports create --path="/Data02/data" --roclient="Data" --roclient="BACKUP" --rwclient="Data" --rwclient="BACKUP" --rootclient="Data" --rootclient="BACKUP"
isi nfs exports create --path="/Data03/data" --roclient="Backup" --roclient="Data" --rwclient="Backup" --rwclient="Data" --rootclient="Backup" --rootclient="Data"
isi nfs exports create --path="/Data04/data" --roclient="Backup" --roclient="ProdGroup" --rwclient="Backup" --rwclient="ProdGroup" --rootclient="Backup" --rootclient="ProdGroup"
isi nfs exports create --path="/" --roclient="127.0.0.1" --roclient="127.0.0.1" --roclient="127.0.0.1" -rootclient="127.0.0.1"

The output script can now be copied to and run from the Isilon.

Step 4 – Generate the EMCOPY commands

Now that the scripts have been generated and run on the Isilon, the next step is the actual data migration using EMCOPY.  This script will generate the commands for a migration script, which should be run from a windows server that has access to both the source and destination locations. It should be run after the previous three scripts have successfully completed.

This script will output the commands directly to the screen, it can then be cut and pasted from the screen directly into a windows batch script on your migration server.

#!/usr/bin/perl

# EMCOPY_create.pl – Create the EMCOPY migration script
# Perform the data migration with EMCOPY using the output from this script.

use strict;

my $datamover = "server_4";
my $source = "\\\\celerra_path\\";
my $dest = "\\\\isilon_path\\";
my $prot = "cifs";
my $nfs_cli = "server_export $datamover -list -P $prot -v |grep share";

open (OUTPUT, ">> create_smb_exports_$$.sh") || die "cant open output: $!\n\n";
open (CMD, "$nfs_cli |") || die "cant open $nfs_cli: $!\n\n";

while ()
{
   chomp;
   my (@vars) = split(" ", $_);
   my $path = $vars[2];
   my $name = $vars[1];

   $name =~ s/\"//g;
   $path =~ s/^/\/ifs/;

   my $log = "c:\\" . $name . "";
   $log =~ s/ //;
   my $src = $source . $name;
   my $dst = $dest . $name;

   print "emcopy \"$src\" \"$  dst\" /o /s /d /q /secfix /purge /stream /c /r:1 /w:1 /log:$log\n";
}

close(CMD);

The Output of the script looks like this (this is an excerpt from the screen output):

emcopy "\\celerra_path\Data01" "\\isilon_path\billing_tmip_01" /o /s /d /q /secfix /purge /stream /c /r:1 /w:1 /log:c:\billing_tmip_01
emcopy "\\celerra_path\Data02" "\\isilon_path\billing_trxs_01" /o /s /d /q /secfix /purge /stream /c /r:1 /w:1 /log:c:\billing_trxs_01
emcopy "\\celerra_path\Data03" "\\isilon_path\billing_vru_01" /o /s /d /q /secfix /purge /stream /c /r:1 /w:1 /log:c:\billing_vru_01
emcopy "\\celerra_path\Data04" "\\isilon_path\billing_rpps_01" /o /s /d /q /secfix /purge /stream /c /r:1 /w:1 /log:c:\billing_rpps_01

That’s it.  Good luck with your data migration, and I hope this has been of some assistance.  Special thanks to Mark May and his virtualstoragezone blog, he published the original versions of these scripts here.

Mutiprotocol VNX File Systems: Listing and counting Shares & Exports by file system

I’m in the early process of planning a NAS data migration from VNX to Isilon, and one of the first steps I wanted to accomplish was to identify which of our VNX file systems are multiprotocol (with both CIFS shares and NFS exports from the same file system). In the environment I support, which has over 10,000 cifs shares, it’s not a trivial task to identify which shares are multiprotocol.  After some research it doesn’t appear that there is a built in method from EMC for determining this information from within the Unisphere GUI. From the CLI, however, the server_export command can be used to view the shares and exports.

Here’s an example of listing shares and exports with the server_export command:

[nasadmin@VNX1 ~]$server_export ALL -Protocol cifs -list | grep filesystem01

share "share01$" "/filesystem01/data" umask=022 maxusr=4294967294 netbios=NASSERVER comment="Contact: John Doe"
 
[nasadmin@VNX1 ~]$server_export ALL -Protocol nfs -list | grep filesystem01

export "/root_vdm_01/filesystem01/data01" rw=admins:powerusers:produsers:qausers root=storageadmins access=admins:powerusers:produsers:qausers:storageadmins
export "/root_vdm_01/filesystem01/data02" rw=admins:powerusers:produsers:qausers root=storageadmins access=admins:powerusers:produsers:qausers:storageadmins

The output above shows me that the file system named “filesystem01” has one cifs share and two NFS exports.  That’s a good start, but I want to get a count of the number of shares and exports rather than a detailed list of them. I can accomplish that by adding ‘wc’ [word count] to the command:

[nasadmin@VNX1 ~]$ server_export ALL -Protocol cifs -list | grep filesystem01 | wc
 1 223 450

[nasadmin@VNX1 ~]$ server_export ALL -Protocol nfs -list | grep filesystem01 | wc
 2 15 135

That’s closer to what I want.  The output includes three numbers and the first number is the line count.  I really only want that number, so I’ll just grab it with awk. Ultimately I want the output to go to a single file with each line containing the name of
the file system, the number of CIFS shares, and the number of NFS Exports.  This line of code will give me what I want:

[nasadmin@VNX1 ~]$ echo -n "filesystem01",`server_export ALL -Protocol cifs -list | grep filesystem01 | wc | awk '{print $1}'`, `server_export ALL -Protocol nfs -list | grep filesystem01 | wc | awk '{print $1}'` >> multiprotocol.txt ; echo 
" " >> multiprotocol.txt

The output is below.  It’s perfect, as it’s in the format of a comma delimited file and can be easily exported into Microsoft Excel for reporting purposes.

filesystem01, 1, 2

Here’s a more detailed explanation of the command:

echo -n “filesystem01”, : Echo will write the name of the file system to the screen or to a file if you’ve redirected it with “>” at the end of the command.  Adding the “-n” supresses the “new line” that is automatically created after text is outputted, as I want each file system and it’s share & export count to be on the same line in the report.

`server_export ALL -Protocol cifs -list | grep filesystem01 | wc | awk ‘{print $1}’`,: The server_export command lists all of the cifs shares for the file system that you’re grepping for.  The wc command is for the “word count”, we’re using it to count the number of output lines to verify how many exports exist for the specified file system.  The awk ‘{print $1}’ command will output only the first item of data, when it hits a blank space it will stop.  If the output is “1 23 34 32 43 1”, running ‘{print $1}’ will only output the 1.

`server_export ALL -Protocol nfs -list | grep filesystem01 | wc | awk ‘{print $1}’` >> multiprotocol.txt: This is the same command as above, but we’re now counting the number of NFS exports rather than CIFS shares.

; echo ” ” >> multiprotocol.txt:  After the count is complete and the data has been outputted, I want to run an echo command without the “-n” option to force a line break to the next line, in preparation for the next line of the script.  When exporting, using “>” will output the results to a file and overwrite the file if it already exists, if you use “>>”, it will append the results to the file if it already exists.  In this case we want to append each line.  In an actual script you’d want to create a blank file first with “echo > filename.ext”. Also, the “;” prior to the command instructs the interpreter to start a new command regardless of the success or failure of the prior command.

At this point, all that needs to be done is to create a script that includes the line above with every file system on the VNX. I copied the line of code above into excel into multiple columns, allowing me to copy and paste the file system list from Unisphere and then concatenate the results into a single script file. I’m including a screenshot of one of my script lines from Excel as an example.  The final column (AG) has the following formula:

=CONCATENATE(A4,B4,C4,D4,E4,F4,G4,H4,I4,J4,K4,L4,M4,N4,O4,P4,Q4,R4,S4,T4,U4,V4,W4,X4,Y4,Z4,AA4,AB4,AC4,AD4,AE4)

Spreadsheet example:

multiprotocol count

 

VNX NAS Files incorrectly report as Locked for Editing

When opening a shared Microsoft Office file, you may see the error message “File in Use, file_name is locked for editing by user_name“, when in fact no other user is currently using the file.

We had users that would view files with the preview pane, it would create a lock on the file, and then when the explorer window was closed the lock would remain.  The next time the file was accessed it would state that it was locked even though the user didn’t have it open.  Below are some steps you can take to troubleshoot and resolve the issue.  Note that changing some of these parameters can have a performance impact, so make these changes at your own risk.  Oplocks let clients lock files and locally cache information while preventing another user from changing the file. This increases performance for many file operations.

1. Disable Oplocks on the VNX

Disabling oplocks can affect client performance. It will increase the number of metadata requests that are sent to the server because when you use SMB with oplocks, the client caches the data that is locked to speed up access to frequently accessed files. When oplocks are disabled, the client does not cache data and all reads are made directly to the NAS server.

Syntax for disabling oplocks and verifying the change:

[nasadmin@VNX1 ~]$ server_mount vdm_file_system -o nooplock test_file_system01_fs /test_file_system01
vdm_file_system : done

[nasadmin@VNX1 ~]$ server_mount vdm_file_system | grep test_file_system01_fs
test_file_system01_fs on /test_file_system01 uxfs,perm,rw,noprefetch,nonotify,accesspolicy=NATIVE,nooplock

2. Disable caching on the Windows client

The Windows client setting controls the cache lifetime. As stated earlier, if caching is disabled on the windows client then all reads are to the NAS server directly. In order to to disable caching on the windows client rather than disabling oplocks on the VNX Data Mover, the following three registry changes would need to be made:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters\

– Directory cache,  set DirectoryCacheLifetime to Zero.
– File Not Found cache, set FileNotFoundCacheLifetime to Zero.
– File information cache, set FileInfoCacheLifetime to Zero.

3. Apply a Microsoft hotfix

The Microsoft KB article http://support.microsoft.com/kb/942146 describes the problem and the fix in detail.  It directly addresses the issue with files locking from the preview pane.  It applies to all versions of Windows Vista and 7 as well as Windows Server 2008.

Scripting an alert to verify the availability of individual VNX CIFS server shares

It was recently asked to come up with a method to alert on the availability of specific CIFS file shares in our environment.  This was due to a recent issue we had on our VNX with our data mover crashing and causing the corruption of a single file system when it came back up.  We were unaware for several hours of the one file system being unavailable on our CIFS server.

This particular script would require maintenance whenever a new file system share is added to a CIFS server.  A unique line must to be added for every file system share that you have configured.  If a file system is not mounted and the share is inaccessible, an email alert will be sent.  If the share is accessible the script does nothing when run from the scheduler.  If it’s run manually from the CLI, it will echo back to the screen that the path is active.

This is a bash shell script, I run it on a windows server with Cygwin installed using the ‘email’ package for SMTP.  It should also run fine from a linux server, and you could substitute the ‘email’ syntax for sendmail or whatever other mail application you use.   I have it scheduled to check the availability of CIFS shares every one hour.

DIR1=file_system_1; SRV1=cifs_servername;  echo -ne $DIR1 && echo -ne ": " && [ -d //$SRV1/$DIR1 ] && echo "Network Path is Active" || email -b -s "Network Path \\\\$SRV1\\$DIR1 is offline" emailaddress@email.com

DIR2=file_system_2; SRV1=cifs_servername;  echo -ne $DIR1 && echo -ne ": " && [ -d //$SRV1/$DIR2 ] && echo "Network Path is Active" || email -b -s "Network Path \\\\$SRV1\\$DIR2 is offline" emailaddress@email.com

DIR3=file_system_3; SRV1=cifs_servername;  echo -ne $DIR1 && echo -ne ": " && [ -d //$SRV1/$DIR3 ] && echo "Network Path is Active" || email -b -s "Network Path \\\\$SRV1\\$DIR3 is offline" emailaddress@email.com

DIR4=file_system_4; SRV1=cifs_servername;  echo -ne $DIR1 && echo -ne ": " && [ -d //$SRV1/$DIR4 ] && echo "Network Path is Active" || email -b -s "Network Path \\\\$SRV1\\$DIR4 is offline" emailaddress@email.com

DIR5=file_system_5; SRV1=cifs_servername;  echo -ne $DIR1 && echo -ne ": " && [ -d //$SRV1/$DIR5 ] && echo "Network Path is Active" || email -b -s "Network Path \\\\$SRV1\\$DIR5 is offline" emailaddress@email.com

Rescan Storage System command on Celerra results in conflict:storageID-devID error

I was attempting to extend our main production NAS file pool on our NS-960 and ran into an issue.  I had recently freed up 8 SATA disks from a block pool and was attempting to re-use them and extend a Celerra file pool.  I created a new RAID Group and LUN that used the maximum capacity of the RAID Group.  I then added the LUN to the celerra storage group, making sure to set the HLU to a number greater than 15.  I then changed the setting on our main production file pool to auto-extend, and clicked on the “Rescan Storage Systems” option.  Unfortunately rescanning produced an error every time it was run.  I have done this exact same procedure in the past and it’s worked fine.  Here is the error:

conflict:storageID-devID: disk=17 old:symm=APM00100600999,dev=001F new:symm=APM00100600999,dev=001F addr=c16t1l11

I checked the disks on the Celerra using the nas_disk –l command, and the new disk shows up as “in use” even though the rescan command didn’t properly complete.

[nasadmin@Celerra tools]$ nas_disk -l
id   inuse  sizeMB    storageID-devID      type   name  servers
17    y     7513381   APM00100600999-001F  CLATA  d17   <BLANK>

Once the dvol is presented to Celerra (assuming the rescan goes fine) it should not be inuse until it is assigned to a storage pool and a file system uses it.  In this case that didn’t happen.  If you run /nas/tools/whereisfs (depending on your DART version, it may be “.whereisfs” with the dot) it shows a listing of every file system and which disk and which LUN they reside on.  I verified that the disk was not in use using that command.

In order to be on the safe side, I opened an SR with EMC rather than simply deleting the disk.  They suggested that the NAS database has a corruption. I’m going to have EMC’s Recovery Team check the usage of the diskvol and then delete it and re-add it.  In order to engage the recovery team you need to sign a “Data Deletion Form” absolving EMC of any liability for data loss, which is standard practice when they delete volumes on a customer array.  If there are any further caveats or important things to note after EMC has taken care of this I’ll update this post.

The steps for NFS exporting a file system on a VDM

I made a blog post back in January 2014 about creating an NFS export on a virtual data mover but I didn’t give much detail on the commands you need to use to actually do it. As I pointed out back then, you can’t NFS export a VDM file system from within Unisphere however when a file system is mounted on a VDM its path from the root of the physical Data Mover can be exported from the CLI.

The first thing that needs to be done is determining the physical Data Mover where the VDM resides.

Below is the command you’d use to make that determination:

[nasadmin@Celerra_hostname]$ nas_server -i -v name_of_your_vdm | grep server
server = server_4

That will show you just the physical data mover that it’s mounted on. Without the grep statement, you’d get the output below. If you have hundreds of filesystems it will cause the screen to scroll the info you’re looking for off the top of the screen. Using grep is more efficient.

[nasadmin@Celerra_hostname]$ nas_server -i -v name_of_your_vdm
id = 1
name = name_of_your_vdm
acl = 0
type = vdm
server = server_4
rootfs = root_fs_vdm_name_of_your_vdm
I18N mode = UNICODE
mountedfs = fs1,fs2,fs3,fs4,fs5,fs6,fs7,fs8,…
member_of =
status :
defined = enabled
actual = loaded, active
Interfaces to services mapping:
interface=10-3-20-167 :cifs
interface=10-3-20-130 :cifs
interface=10-3-20-131 :cifs

Next you need to determine the file system path from the root of the Data Mover. This can be done with the server_mount command. As in the prior step, it’s more efficient if you grep for the name of the file system. You can run it without the grep command, but it could generate multiple screens of output depending on the number of file systems you have.

[nasadmin@stlpemccs04a /]$ server_mount server_4 | grep Filesystem_03
Filesystem_03 on /root_vdm_3/Filesystem_03 uxfs,perm,rw

The final step is to actually export the file system using this path from the prior step. The file system must be exported from the root of the Data Mover rather than the VDM. Note that once you have exported the VDM file system from the CLI, you can then manage it from within Unisphere if you’d like to set server permissions. The “-option anon=0,access=server_name,root=server_name” portion of the CLI command below can be left off if you’d prefer to use the GUI for that.

[nasadmin@Celerra_hostname]$ server_export server_4 -Protocol nfs -option anon=0,access=server_name,root=server_name /root_vdm_3/Filesystem_03
server_4 : done

At this point the client can mount the path with NFS.

Dynamic allocation pool limit has been reached

We were having issues with our backup jobs failing on CIFS share backups using Symantec Netbackup.  The jobs died with a “status 24”, which means it was losing communicaiton with the source.  Our backup administrator provided me with the exact times & dates of the failures and I noticed that immediately preceding his failures this error appeared in the server log on the control station:

2012-08-05 07:09:37: KERNEL: 4: 10: Dynamic allocation pool limit has been reached. Limit=0x30000 Current=0x50920 Max=0x0
 

A quick google search came up with this description of the error:  “The maximum amount of memory (number of 8K pages) allowed for dynamic memory allocation has almost been reached. This indicates that a possible memory leak is in progress and the Data Mover may soon panic. If Max=0(zero) then the system forced panic option is disabled. If Max is not zero then the system will force a panic if dynamic memory allocation reaches this level.”

Based on the fact that the error shows up right before a backup failure I saw the correlation.  To fix it, you’lll need to modify the Heap Limit from the default of 0x00030000 to a larger size.  Here is the command to do that:

.server_config server_2 -v “param kernel mallocHeapLimit=0x40000” (to change the value)
.server_config server_2 -v “param kernel” (will list the kernel parameters).
 

Below is a list of all the kernel parameters:

Name                                                 Location        Current       Default
----                                                 ----------      ----------    ----------
kernel.AutoconfigDriverFirst                         0x0003b52d30    0x00000000    0x00000000
kernel.BufferCacheHitRatio                           0x0002093108    0x00000050    0x00000050
kernel.MSIXdebug                                     0x0002094714    0x00000001    0x00000001
kernel.MSIXenable                                    0x000209471c    0x00000001    0x00000001
kernel.MSI_NoStop                                    0x0002094710    0x00000001    0x00000001
kernel.MSIenable                                     0x0002094718    0x00000001    0x00000001
kernel.MsiRouting                                    0x0002094724    0x00000001    0x00000001
kernel.WatchDog                                      0x0003aeb4e0    0x00000001    0x00000001
kernel.autoreboot                                    0x0003a0aefc    0x00000258    0x00000258
kernel.bcmTimeoutFix                                 0x0002179920    0x00000002    0x00000002
kernel.buffersWatermarkPercentage                    0x0003ae964c    0x00000021    0x00000021
kernel.bufreclaim                                    0x0003ae9640    0x00000001    0x00000001
kernel.canRunRT                                      0x000208f7a0    0xffffffff    0xffffffff
kernel.dumpcompress                                  0x000208f794    0x00000001    0x00000001
kernel.enableFCFastInit                              0x00022c29d4    0x00000001    0x00000001
kernel.enableWarmReboot                              0x000217ee68    0x00000001    0x00000001
kernel.forceWholeTLBflush                            0x00039d0900    0x00000000    0x00000000
kernel.heapHighWater                                 0x00020930c8    0x00004000    0x00004000
kernel.heapLowWater                                  0x00020930c4    0x00000080    0x00000080
kernel.heapReserve                                   0x00020930c0    0x00022e98    0x00022e98
kernel.highwatermakpercentdirty                      0x00020930e0    0x00000064    0x00000064
kernel.lockstats                                     0x0002093128    0x00000001    0x00000001
kernel.longLivedChunkSize                            0x0003a23ed0    0x00002710    0x00002710
kernel.lowwatermakpercentdirty                       0x0003ae9654    0x00000000    0x00000000
kernel.mallocHeapLimit                               0x0003b5558c    0x00040000    0x00030000  (This is the parameter I changed)
kernel.mallocHeapMaxSize                             0x0003b55588    0x00000000    0x00000000
kernel.maskFcProc                                    0x0002094728    0x00000004    0x00000004
kernel.maxSizeToTryEMM                               0x0003a23f50    0x00000008    0x00000008
kernel.maxStrToBeProc                                0x0003b00f14    0x00000080    0x00000080
kernel.memSearchUsecs                                0x000208fa28    0x000186a0    0x000186a0
kernel.memThrottleMonitor                            0x0002091340    0x00000001    0x00000001
kernel.outerLoop                                     0x0003a0b508    0x00000001    0x00000001
kernel.panicOnClockStall                             0x0003a0cf30    0x00000000    0x00000000
kernel.pciePollingDefault                            0x00020948a0    0x00000001    0x00000001
kernel.percentOfFreeBufsToFreePerIter                0x00020930cc    0x0000000a    0x0000000a
kernel.periodicSyncInterval                          0x00020930e4    0x00000005    0x00000005
kernel.phTimeQuantum                                 0x0003b86e18    0x000003e8    0x000003e8
kernel.priBufCache.ReclaimPolicy                     0x00020930f4    0x00000001    0x00000001
kernel.priBufCache.UsageThreshold                    0x00020930f0    0x00000032    0x00000032
kernel.protect_zero                                  0x0003aeb4e8    0x00000001    0x00000001
kernel.remapChunkSize                                0x0003a23fd0    0x00000080    0x00000080
kernel.remapConfig                                   0x000208fe40    0x00000002    0x00000002
kernel.retryTLBflushIPI                              0x00020885b0    0x00000001    0x00000001
kernel.roundRobbin                                   0x0003a0b504    0x00000001    0x00000001
kernel.setMSRs                                       0x0002088610    0x00000001    0x00000001
kernel.shutdownWdInterval                            0x0002093238    0x0000000f    0x0000000f
kernel.startAP                                       0x0003aeb4e4    0x00000001    0x00000001
kernel.startIdleTime                                 0x0003aeb570    0x00000001    0x00000001
kernel.stream.assert                                 0x0003b00060    0x00000000    0x00000000
kernel.switchStackOnPanic                            0x000208f8e0    0x00000001    0x00000001
kernel.threads.alertOptions                          0x0003a22bf4    0x00000000    0x00000000
kernel.threads.maxBlockedTime                        0x000208f948    0x00000168    0x00000168
kernel.threads.minimumAlertBlockedTime               0x000208f94c    0x000000b4    0x000000b4
kernel.threads.panicIfHung                           0x0003a22bf0    0x00000000    0x00000000
kernel.timerCallbackHistory                          0x000208f780    0x00000001    0x00000001
kernel.timerCallbackTimeLimitMSec                    0x000208f784    0x00000003    0x00000003
kernel.trackIntrStats                                0x000209021c    0x00000001    0x00000001
kernel.usePhyDevName                                 0x0002094720    0x00000001    0x00000001

Using the server_stats command on Celerra / VNX File

Server_stats is a CLI based real time performance monitoring tool from EMC for the Celerra and VNX file.  This post is meant to give a quick overview of the server_stats command with some samples on using it in a scheduled cron job. If you’re looking to dive into using the server_stats feature, I’d suggest using the online manual pages (man server_stats) to get a good idea of all the features and reviewing the “Managing Statistics for VNX” Guide from EMC here:  http://corpusweb130.emc.com/upd_prod_VNX/UPDFinalPDF/en/Statistics.pdf.

I don’t personally use it so I can’t explain how to set it up, but there is an opensource tool that you can use to push server_stats data to graphite called vnx2graphite. You can get it here: http://www.findbestopensource.com/product/fatz-vnx2graphite.   You can download Graphite here: http://graphite.wikidot.com/.

Here is the command line syntax:

server_stats <movername>

   -list
| -info [-all|<statpath_name>[,…]]
| -service { -start [-port <port_number>]

| -stop
| -delete
| -status }

 | -monitor -action {status|enable|disable}
|[

[{ -monitor {statpath_name|statgroup_name}[,…]
| -monitor {statpath_name|statgroup_name}
[-sort <field_name>]
[-order {asc|desc}]
[-lines <lines_of_output>]

 }…]
[-count <count>]
[-interval <seconds>]
[-terminationsummary {no|yes|only}]
[-format {text [-titles {never|once|<repeat_frequency>}]|csv}]
[-type {rate|diff|accu}]
[-file <output_filepath> [-overwrite]]
[-noresolve]
]

Here’s an explanation of a few of the useful table options and what to look for:

Syntax:  server_stats server_2 -i <interval in sec> -c <# of counts> -table <stat>

table cifs 

-Look at uSec/call. The output is in microseconds, divide by 1000 to convert to milliseconds. This tells you how long it takes the celerra to perform specific CIFS operations.

table dvol 

-This is for disk stats.  It shows the write distribution across all volumes.  Look for IO balance across resources.

table fsvol 

-Use this to check filesystem IO.  You’ll be able to monitor which file systems are getting all of the IO with this table.

Start with an interval of 1 first to look for spikes or bursts and then increase it incrementally (10 seconds, 30 seconds, 1 minute, 5 minutes, etc). You can also use Celerra monitor to get Clariion stats.  Look at queueing, cache flushes, etc.  Writes should be through to cache on the Clariion, and unless your write cache is filling up they should be faster than reads.

Here are some sample commands and what they do:

server_stats server_2 -table fsvol -interval 1 -count 10

-This correlates the filesystem to the meta-volumes and shows the % contribution of write requests for each meta-volume (FS Write Reqs %).

server_stats server_2 -table net -interval 1 -count 10

-This shows Network in (KiB/s) / Network In (Pkts/s) to figure out the packet size.  Do this for in and for out to verify the standard MTU size.

server_stats server_2 -summary nfs,cifs -interval 1 -count 10

-This will give a summary of performance stats for nfs and cifs.

Here are some additional sample commands, and how you can add to your crontab to automatically collect performance data:

Collect CIFS and NFS data every 5 minutes:

*/5 * * * * /nas/bin/server_stats server_2 -monitor cifs.smb1,cifs.smb2,nfs.v2,nfs.v3,nfs.v4,cifs.global,nfs.basic -format csv -terminationsummary no -i 5 -c 60 -type accu -file “/nas/quota/slot_2/perfstats/data/server_2/server_2_`date ‘+\%FT\%T’|sed s/://g`” > /dev/null

In the command above the -type accu option tells the command to accumulates statistics upon each capture rather than starting back at a baseline of zero. You can also do ‘diff’ to capture the difference from interval to interval.

Collect diskVol performance stats every 5 minutes:

*/5 * * * * /nas/bin/server_stats server_2 -monitor diskVolumes-std -i 5 -c 60 -file “/nas/quota/slot_2/perfstats/data/server_2/server_2_`date ‘+\%FT\%T’|sed s/://g`” > /dev/null

Collect top_talkers data every 5 minutes:

*/5 * * * * /nas/bin/server_stats server_2 -monitor nfs.client -i 5 -c 60 -file “/nas/quota/slot_2/perfstats/data/server_2/server_2_`date ‘+\%FT\%T’|sed s/://g`” > /dev/null

Below are some useful nfs and cifs stats that you can monitor (pulled from DART 8.1.2-51).  For a full list, run the command server_stats server_2 -i.

cifs.global.basic.totalCalls,cifs.global.basic.reads,cifs.global.basic.readBytes,cifs.global.basic.readAvgSize,cifs.global.basic.writes,cifs.global.basic.writeBytes,
cifs.global.basic.writeAvgSize,cifs.global.usage.currentConnections,cifs.global.usage.currentOpenFiles

nfs.basic,nfs.client,nfs.currentThreads,nfs.export,nfs.filesystem,nfs.group,nfs.maxUsedThreads,nfs.totalCalls,nfs.user,nfs.v2,nfs.v3,nfs.v4,nfs.vdm

 

Celerra Disk Provisioning Wizard incorrectly believes there are not enough drives available for provisioning

I recently had an issue attempting to extend our production NAS file pool on our NS-960.  We had just added Six new 2TB SATA disks to the array, and when I launched the Disk Provisioning Wizard it gave me this error:

“The number of drives available for provisioning additional storage are insufficient”.

That of course wasn’t true, as a 4+2 RAID6 config is indeed supported on this platform and I had just added six drives. I did come up with a workaround to do it manually thanks to some helpful advice from our local EMC technical rep.  I manually created a RAID6 Raid Group in a 4+2 config, and then created a single LUN using all of the available space in the Raid Group (about 7337GB).   Once the LUN is created, you can add it to the Celerrra storage group, in my case it was named “Celerra_hostname”.

When adding the LUN to the storage group, there is a critical step that you must not skip.  The HLU number must be modified!  After you click on a LUN, click add, look for it in the list and notice that the far right column shows the HLU (Host LUN ID).  The LUN you just added will have a blank entry.  It doesn’t look like it’s an editable field, but it is – simply click on the blank area where the number should be and you’ll get a drop down box.  The number you chose must be greater than 15.  Once you’ve modified the HLU for the new LUN, then click on OK to complete the process.

Next, you’ll want to switch back over to the Celerra Management interface, click on the ‘Storage’ tab, then click on the ‘Rescan Storage Systems’ link.  You will get a warning message that states:

“Rescan detects newly available storage and storage systems. Do not rescan unless all primary Data Movers are operating normally. The operation might take a few minutes to complete.”  

Heed the warning and make sure your data movers are up and functional.  You can monitor the progress in the background tasks area.   On my first attempt the Rescan failed.  I got this error message:

“Storage API code=3593: SYMAPI_C_CLARIION_LOAD_ERROR.  An error occurred while data was being loaded from a Clariion.” | “No additional information is available” | “No recommended action is available”.

At the point I got that error I was at the end of my work day and decided to get back to it the next day.  I had planned on opening an SR.  When I re-ran the same scan the next day it worked fine and my production pool auto-extended.  Problem solved.

How to reserve a Celerra / VNX NAS share for a single file type or group of file types

Several years ago I posted on Celerra/VNX NAS file extension filtering (see here), but didn’t write about file system reservations for specific file types, which is also possible.  You can set up NAS shares so that only the file types you want stored there can be written to the share.

In order to do this, first navigate to the \\NAS_Server\C$ administrative share and open the “.filefilter” folder.  You’ll then want to create the following filter files to complete the configuration:

allfiles[@<sharename>][@NetBIOS_name] – this filter file prohibits all file types from being created on the share.  File types that you want excluded from this blanket deny are identified by regular filter files.

noext[@<sharename>][@NetBIOS_name] – this filter file prohibits files with no extensions from being created on the share.  It will prevent a user from saving a file with no filename extension.

<extension_name>[@<sharename>][@NetBIOS_name] – this filter file identifies the types of files you want allowed on the share.  You will need to configure the ACLs to identify which users and/or groups can create files on the share.  File types specified by regular filter files like this one are the exceptions to the allfiles restriction.  If you wanted to reserve a share for only outlook files and message files, you could create two filter files, pst[@][@NetBIOS_name] and msg[@][@NetBIOS_name], then set appropriate permissions.

 

How to create a clone of a file system on a Celerra or VNX using nas_copy

Below are the steps used to create a clone of a file system on a Celerra or VNX.  Cloning can be done to the same data mover, a different data mover on the same array, or to a completely different Celerra or VNX data mover.

  • The source file system needs to be mounted read-only, or alternately you can use a checkpointed copy of the file system.  Using a checkpoint is the least disruptive method if the source file system is being used in production.
  • The destination file system must be the same size or larger than the source file system and must also be mounted read-only.
  • If you created the new file system for the clone copy on the same storage pool as the source file system, the copy performance will suffer as you’re reading/writing to the same disks.
  • The ReplicatorV2 license must be enabled for nas_copy to work.

To check the status of your installed licenses, use nas_license -list. The output looks like this:

[nasadmin@celerra01 ~]$ nas_license -list
key status value
site_key online 51 56 2e 69
cifs online
nfs online
iscsi online
snapsure online
replicatorV2 online

  • To enable the replicator license, run this command:

nas_license -create replicatorV2.

  • Once the source file system (or checkpoint, depending on which one you chose) and target file system are mounted correctly, verify the correct interconnect you want to use for the copy. You can review the configured interconnects with this command:

[nasadmin@celerra01 ~]$ nas_cel -interconnect -list

  • To begin the clone, run the nas_copy command, which has the following syntax:

[nasadmin@celerra01 ~]$ nas_copy

-name <sessionName>
     -source
     {-fs {<name> | id=<fsId>} | -ckpt {<ckptName> | id=<ckptId>}
     -destination
     {-fs {id=<dstFsId>|<existing_dstFsName>}
     |-pool {id=<dstStoragePoolId>}|<dstStoragePool>}}
     [-from_base {<ckpt_name>|id=<ckptId>}]
     -interconnect {<name> | id=<interConnectId>}
     [-source_interface {<nameServiceInterfaceName> | ip=<ipaddr>}]
     [-destination_interface {<nameServiceInterfaceName> | ip=<ipaddr>}]
     [-overwrite_destination]
     [-refresh]
     [-full_copy]
     [-background]

  • Below is an example of a valid nas_copy command. This command will create a copy of the source files system on the same Data Mover.

[nasadmin@celerra01 ~]$ nas_copy -name copy_session -source -ckpt checkpoint_of_source_filesystem -destination -fs clone_of_Source_filesystem -interconnect loopback -background

  • You can monitor the progress of the clone copy using nas_replicate -list.  The output of the command looks like this:
Name               Type        Local Mover  Interconnect      Celerra        Status
Site1_VDM01        vdm         server_2     <--Site2_VNX5700  Site1_NS960    Stopped
Site2_Filesystem2  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
Site2_Filesystem3  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
Site2_Filesystem4  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
Site2_Filesystem5  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
.
.
.

Celerra / VNX File replication job creation errors when selecting destination

I recently had an issue with setting up a new Celerra file system replication job. As soon as I selected the destination system I received the three errors below.

1) Query VDMs All. Cannot access any Data Mover on the remote system, hostname

Severity: Error
Brief Description: Cannot access any Data Mover on the remote system, hostname
Full Description: No Data Movers are available on the specified remote system to perform this operation
Recommended Action: 1) Check if the Data Movers on the specified remote system are accessible. 2) Ensure that the difference in system times between the local and remote Celerra systems or VNX systems does not exceed 10 minutes. Use NTP on the Control Stations to synchronize the system clocks. 3) Ensure that the passphrase on the local Control Station matches with the passphrase on the remote Control Station. 4) Ensure that the same local users that manage VNX for file systems exist on both the source and the destination Control Station. 5) Ensure that the global account is mapped to the same local account on both local and remote VNX Control Stations. Primus emc263860 provides more details.
Message ID: 13690601568

2) Query storage pools All. Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]

Severity: Error
Brief Description: Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]
Full Description: Operation failed for the reason described in the accompanying message.
Recommended Action: Correct the cause of the problem and repeat the operation.
Message ID: 13421840573

3) There are no destination pools available.

Severity: Info
Brief Description: There are no destination pools available.
Full Description: Destination side storage pools are not available.
Recommended Action: Check whether the storage pools have enough space.
Message ID: 26845970450

I was unable to determine the cause of the problem so I opened an SR with EMC.

It turns out there was a user discrepancy between the /etc/passwd file and the /nas/site/user_db file. This was causing the following error when checking the interconnect:

[nasadmin@celerra02 log]$ nas_cel -interconnect -l
Error 2237: Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]

The output should look something like this:

[nasadmin@celerra02 log]$ nas_cel -interconnect -l
id    name          source_server destination_system destination_server
20001 loopback      server_2      DRSITE1            server_2
20003 SITE1VNX5500  server_2      DRSITE1            server_2
20004 SITE2NS960    server_2      DRSITE2            server_2
20007 SITE11NS960   server_2      DRSITE1            server_2
20005 SITE3VNX5500  server_2      DRSITE1            server_2
20006 SITE4VNX5500  server_2      DRSITE1            server_2
20008 SITE5NS120    server_2      DRSITE1            server_2
40001 loopback      server_4      DRSITE1            server_4
40003 SITE2NS40     server_4      DRSITE2            server_2

The problem was resolved by removing entries from /nas/site/user_db that were not in the /etc/passwd file. This was caused by a manual modification of the passwd file by a sysadmin, some old entries had been removed and the matching changes were not done in the user_db file.

 

Using Microsoft’s BranchCache with the Celerra & VNX

We recently moved the data from several of our small offices to being hosted at a central regional data center. In one case, a site that formerly had it’s own Celerra for file access was now accessing NAS through a WAN link.  As expected, access to files was much slower.  After doing some research on how to speed up file access for users at the branch office locations I came across Microsoft’s BranchCache feature.

BranchCache is a WAN bandwidth optimization technology and is available on Windows 7 & 8, Server 2008 R2 and Server 2012. When BrancheCache is enabled, it creates a cache of the content from the file server locally within the branch office. A client from the same network can then access the file very quickly from cache instead of having to download it across the WAN again. It’s a great feature for optimizing local link utilization, increasing the responsiveness of applications, and reducing WAN bandwidth consumption.

BranchCache can operate in either Hosted Cache Mode or Distributed Cache mode. In Hosted Cache mode, there is a Windows server configured to store the cached files. In distributed cache mode (appropriate for smaller sites) local clients in the office keep a copy of the content and make it available to other clients that access the same files.

Once your Windows Administrator has BranchCache configured on a server (or it’s been enabled on the local client PC’s), enabling it on the Celerra/VNX side is very simple. Log in to the CLI and su to gain root credentials. Then type in the following command:

server_cifs server_2 -smbhash -service enable

If you’d like to enable BranchCache auditing so the Windows server administrators can see audit info the Windows event logs, type in this command:

server_cifs server_2 -smbhash -audit enable

After that, you will need to restart the CIFS service on the data mover. Here are the commands to stop and start CIFS:

server_setup server_2 -P cifs -o stop
server_setup server_2 -P cifs -o start

To confirm that BranchCache was successfully enabled, type the following command:

server_cifs server_2 -smbhash -info

The output should look like this:

server_2:
Current smbhash parameters:
—————————
Enabled : Yes
Started : Yes

Finally, To enable hash support on each individual CIFS Share that you want to use for Branch Cache clients, use the command syntax below.  Note that EMC’s “Configuring  Branch Cache” document I link to at the bottom of this post does not contain this next command (major oversight).  It was pulled from page 58 of EMC’s Configuring CIFS Guide.

# server_export <datamover_name> -name <fs_name> -option <netbios_name_of_CIFS_Server>, type=hash /<fs_name>

EMC has a detailed document on how to configure BranchCache, including all the steps you’d need to take to configure the server and PC clients. If you have an EMC support account you can download it here:

https://support.emc.com/docu42265_Configuring-BranchCache-V2-on-VNX.pdf?language=en_US

There is also a lengthy section on BranchCache in EMC’s CIFS Management for VNX guide here:

https://mydocs.emc.com/VNXDocs/CIFS.pdf.

Here is the link to Microsoft’s TechNet guide for BranchCache:

http://technet.microsoft.com/en-us/library/hh831696.aspx.

Creating an NFS export from a file system on a Virtual Data Mover (VDM)

When creating a new file system on a virtual data mover, you may have noticed that it can’t be NFS exported from within Unisphere.  From the GUI, file systems associated with a VDM do not appear on the drop down list for NFS exports when you attempt to create one.  I had always assumed that it simply couldn’t be done until we had a business need for it and I investigated it a bit further.  As it turns out NFS exports can be mounted on a VDM path from the command line interface, with a few restrictions.

1. You need to export the NFS mountpoint at the physical Data Mover level (server_2, server_3, etc.) and include the nested path of the VDM.  The path is generally /root_vdm_X/fs_name. You cannot export the NFS on the VDM level.

2. Since the NFS export is done at the physical Data Mover level it is not restricted to one specific VDM.

3. The nested NFS export can only be done from command line, it is not an option from within Unisphere.  Once you have created the NFS export, however, you can use the GUI to make changes to it.

4. The client must map the NFS export by using celerra:/root_vdm_X/fs_name, or they need to set up NFS export aliasing to hide the nested path.

5. IP replication of the VDM will not replicate the source site NFS exports to the destination site. You will need to manually create those NFS exports on the destination side. Also, you cannot access those file systems unless the VDM has failed over.

What is EMC’s CAVA / Common Event Enabler?

I was recently asked to do a bit of research on EMC’s CAVA product, as we are looking for AntiVirus solutions for our CIFS based shares.  I found very little info with general google searches about exactly what CAVA is and what it does, so I thought I’d share some of the information that I did find after a bit of research and talking to my local EMC rep. 

Basically CAVA is a service runs on the Celerra (or VNX) data mover in conjunction with a Windows server running a 3rd Party Anti-Virus engine (along with EMC’s CAVA API agent) to handle the conversation.  It only facilitates the communication to an existing AV server, EMC doesn’t provide the actual AV software.  It supports Symantec, McAfee, eTrust, Sophos, Kaspersky, and Trend Micro.  In a nutshell, CAVA employs three key components:  Software on the data mover (VC Client), Software on a windows AV server (CAVA), and your 3rd party AV engine on a Windows server. 

CAVA used to stand for “Celerra Anti Virus Agent”, but was changed to “Common AntiVirus Agent”.  Quite convenient that they could re-use the “C” without changing the acronym, right? The product is now officially known as “Common Event Enabler for Windows” by EMC and the package includes CEPA, or the EMC Common Event Publishing Agent, and CAVA, the aforementioned Common Antivirus Agent.  For this post I’m focusing on the Antivirus agent.

CAVA is a fairly straightforward install, however if implemented incorrectly it can adversely affect your performance. It’s important to know how it scans your files and essential to know how to troubleshoot it and do performance monitoring.  There is definitely a performance hit when using CAVA. 

When are files scanned for a virus? 

Each time the Celerra receives a file, it will be locked for read access first, at which time a request is sent to the AV server (or servers) to scan the file.  The Celerra will send the UNC path name to the windows server and wait for verification that the file is not affected.  Once that verification is complete, the file is made available for user access. 

CAVA will scan a file in the following instances: 

  •          CAVA will scan files for a virus the first time that a file is read, subsequent to the initial implementation of CAVA and any updates to virus definitions.
  •          Creating, modifying, or moving a file
  •          When restoring a file (or files) from backup
  •          When renaming a file with a different file extension
  •          Whenever an administrator performs a full file system scan (with the server_viruschk command) 

What are the features of CAVA? 

  •          Automatic Virus Definition Updates. Files opened after the update will be re-scanned.
  •          CAVA Calculator (a free sizing tool to assist in implementation)
  •          User Notifications on Virus detection, cofigurable by administrators to be sent as notifications to the client, event log entries, or both.
  •          Scan on read can be enabled
  •          Event reporting and configuration 

What are some implementation considerations? 

  •          EMC recommends that an MPFS client system not be configured as the AV server system.
  •          CAVA doesn’t support a data mover CIFS server using share level access.
  •          Always update the viruschecker.conf file to avoid scanning temp files. It can be modified with the Celerra AV Management Snap-In.
  •          It’s CIFS only. There is no support for NFS or FTP.  If those protocols are used to open, modify, or move files the files will not be scanned.
  •          You must check for compatibility with your installed 3rd party AV software.

How is it licensed, and how much does it cost?

CAVA is licensed per array, on the VNX series it is in the Security and Compliance Suite.   Pricing will vary of course, but it’s not very expensive relative to the cost of the array.  It should be in the range of thousands rather than tens of thousands of dollars.

 

Changing the Login banner on a Celerra / VNX File control station

If you have a security requirement to change the login screen on operating systems that are accessible via a terminal session, it’s a farily easy change on a Celerra or VNX File control station due to the DART OS being based on Linux.  Simply log in as root and edit the /etc/issue file with vi.  It contains the login banner that you see immediately after logging in to a control station.

This is what the default /etc/issue file contains on a VNX control station, and what you’ll see by default when you log in:

A customized version of the Linux operating system is used on the
EMC(R) VNX(TM) Control Station.  The operating system is
copyrighted and licensed pursuant to the GNU General Public License
(“GPL”), a copy of which can be found in the accompanying
documentation.  Please read the GPL carefully, because by using the
Linux operating system on the EMC VNX you agree to the terms
and conditions listed therein.
 
EXCEPT FOR ANY WARRANTIES WHICH MAY BE PROVIDED UNDER THE TERMS AND
CONDITIONS OF THE APPLICABLE WRITTEN AGREEMENTS BETWEEN YOU AND EMC,
THE SOFTWARE PROGRAMS ARE PROVIDED AND LICENSED “AS IS” WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT
NOT LIMITED TO, THE IMPLIED MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE.  In no event will EMC Corporation be liable to
you or any other person or entity for (a) incidental, indirect,
special, exemplary or consequential damages or (b) any damages
whatsoever resulting from the loss of use, data or profits,
arising out of or in connection with the agreements between you
and EMC, the GPL, or your use of this software, even if advised
of the possibility of such damages.
 
EMC, VNX, Celerra, and CLARiiON are registered trademarks or trademarks of
EMC Corporation in the United States and/or other countries. All
other trademarks used herein are the property of their respective
owners.
 
EMC VNX Control Station Linux release 3.0 (NAS 7.1.71)
 

I don’t work for IBM, but below is an example of what you could change the /etc/issue file to.  I included the hostname of the control station, our company logo, and a legal warning regarding unathorized logins and system monitoring.  Because the standard login banner from EMC includes the DART OS Version on the last line, I suspect that the /etc/issue file is updated along with any OS upgrades or patches, although I can’t confirm that right now.  I’m keeping a backup copy in the same directory to easily change it back after a DART upgrade.

You can also edit the /etc/motd file, which displays immediately after a successful login.  By default the file contains the message “EMC Celerra Control Station Linux” or “EMC VNX Control Station Linux”, along with what I assume to be the date of the original OS installation.

[sourcecode language=”css”]
—————————————————————
EMC VNX Control Station <Host_Name>
—————————————————————
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBB    MMMMMMM     MMMMMMM
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBBB   MMMMMMMM   MMMMMMMM
IIII         BBBB           BBBB  MMMM MMMM MMMM MMMM
IIII          BBBB           BBBB  MMMM MMMM MMMM MMMM
IIII          BBBBBBBBBBBBBBBBB    MMMM  MMMMMMM  MMMM
IIII          BBBBBBBBBBBBBBBBB    MMMM   MMMMM   MMMM
IIII          BBBB           BBBB  MMMM    MMM    MMMM
IIII          BBBB           BBBB  MMMM     M     MMMM
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBBB   MMMM           MMMM
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBB    MMMM           MMMM
—————————————————————
Warning: This system is restricted to IBM authorized users for
business purposes only. Unauthorized access or use is
a violation of of company policy and the law.
—————————————————————
This system may be monitored for administrative and security
reasons. By logging in, you agree that you have read and
understand this notice.
—————————————————————
nasadmin@<Host_Name>’s password:
[/sourcecode]

Alerting on VNX File SP Failover

You may see a Celerra alert description that states “Disk dx has been trespassed” or “Storage Processor is ready to restore”.  This would mean that one or more Celerra LUNs aren’t being accessed through the default SP.  It can happen during a DART upgrade, some other maintenance activity, or simply a temporary loss of connectivity to the default SP for the LUN.  I wanted to set up an email alert to let the storage team know when an SP failover occurs, so I wrote a script to send an alert email when a failover is detected.  It can be run directly on the data mover and scheduled as frequently as you like via cron.

Here’s the script:

#!/bin/bash

TODAY=$(date) 
HOST=$(hostname) 

STATUS=`cat /scripts/health/arrayFOstatus.txt`

# Checks and displays status of NAS storage, will show SP/disk Failover info. 
# We will use this info to include in the alert email if needed. 

/nas/bin/nas_storage -check -all > /scripts/health/backendcheck.txt 

#  [nasadmin@celerra]$ nas_storage -check -all     
#  Discovering storage (may take several minutes) 
#  Error 5017: storage health check failed #  CK900052700319 SPA is failed over #  CK900052700319 d6 is failed over

# Shows detailed info, I'm only pulling out failover info. 

/nas/bin/nas_storage -info -all | grep failed_over > /scripts/health/failovercheck.txt 

# The command above results in this output: 
#   failed_over = <True/False> 
#   failed_over = <True/False> 
#   failed_over = <True/False> 

# The first entry is the value for the array, second is SPA, third is SPB. 

# The next line pulls the True/False value for the entire array (the third value on the first line of output) 

echo `cat /scripts/health/failovercheck.txt | awk '{if (NR<2) print $3}'` > /scripts/health/arrayFOstatus.txt

# Now we check the value in the 'arrayFOstatus.txt' file, if it's 'True', we send an email notification that there is an SP failed over. 

# In addition to sending an email, you could also run the 'nas_storage -failback id=1' command to automatically fail it back.

if [ "$STATUS" == "False" ]; then  
   echo "Value is False" 
fi

if [ "$STATUS" == "True" ]; then  
   mail -s "SP Failover on $HOST" username@domain.com < /scripts/health/backendcheck.txt  

   #nas_storage -failback id=1 #Optionally fail it back, our team decided to alert only and fail back manually.

   echo "Value is True" 
fi

If a failover is detected, you can manually fail it back with the following commands:

Determine/Confirm the ID number:

[nasadmin@celerra]$ nas_storage -list
 id   acl    name                     serial_number
 1    0      CK900052700319 CK900052700319

Fail it back (will fail back Celerra/VNX File LUNs only):

[nasadmin@celerra]$ nas_storage -failback id=1
id  = 1  
serial_number   = CCK900052700319  
name  = CCK900052700319  
acl  = 0  
done

Reporting on Celerra / VNX NAS Pool capacity with a bash script

I recently created a script that I run on all of our celerras and VNX’s that reports on NAS pool size.   The output from each array is then converted to HTML and combined on a single intranet page to provide a quick at-a-glance view of our global NAS capacity and disk space consumption.  I made another post that shows how to create a block storage pool report as well:  http://emcsan.wordpress.com/2013/08/09/reporting-on-celerravnx-block-storage-pool-capacity-with-a-bash-script/

The default command unfortunately outputs in Megabytes with no option to change to GB or TB.  This script performs the MB to GB conversion and adds a comma as the numerical separator (what we use in the USA) to make the output much more readable.

First, identify the ID number for each of your NAS pools.  You’ll need to insert the ID numbers into the script itself.

[nasadmin@celerra]$ nas_pool -list
 id      inuse   acl     name                      storage system
 10      y       0       NAS_Pool0_SPA             AKM00111000000
 18      y       0       NAS_Pool1_SPB             AKM00111000000
Note that the default output of the command that provides the size of each pool is in a very hard to read format.  I wanted to clean it up to make it easier to read on our reporting page.  Here’s the default output:
[nasadmin@celerra]$ nas_pool -size -all
id           = 10
name         = NAS_Pool0_SPA
used_mb      = 3437536
avail_mb     = 658459
total_mb     = 4095995
potential_mb = 0
id           = 18
name         = NAS_Pool1_SPB
used_mb      = 2697600
avail_mb     = 374396
total_mb     = 3071996
potential_mb = 1023998
 My script changes the output to look like the example below.
Name (Site)   ; Total GB ; Used GB  ; Avail GB
 NAS_Pool0_SPA ; 4,000    ; 3,356.97 ; 643.03
 NAS_Pool1_SPB ; 3,000    ; 2,634.38 ; 365.62
 In this example there are two NAS pools and this script is set up to report on both.  It could be easily expanded or reduced depending on the number of pools on your array. The variable names I used include the Pool ID number from the output above, that should be changed to match your ID’s.  You’ll also need to update the ‘id=’ portion of each command to match your Pool ID’s.

Here’s the script:

#!/bin/bash

NAS_DB="/nas"
export NAS_DB

# Set the Locale to English/US, used for adding the comma as a separator in a cron job
export LC_NUMERIC="en_US.UTF-8"
TODAY=$(date)

 

# Gather Pool Name, Used MB, Avaialble MB, and Total MB for First Pool

# Set variable to pull the Name of the pool from the output of 'nas_pool -size'.
name18=`/nas/bin/nas_pool -size id=18 | /bin/grep name | /bin/awk '{print $3}'`

# Set variable to pull the Used MB of the pool from the output of 'nas_pool -size'.
usedmb18=`/nas/bin/nas_pool -size id=18 | /bin/grep used_mb | /bin/awk '{print $3}'`

# Set variable to pull the Available MB of the pool from the output of 'nas_pool -size'.
availmb18=`/nas/bin/nas_pool -size id=18 | /bin/grep avail_mb | /bin/awk '{print $3}'`
# Set variable to pull the Total MB of the pool from the output of 'nas_pool -size'.

totalmb18=`/nas/bin/nas_pool -size id=18 | /bin/grep total_mb | /bin/awk '{print $3}'`

# Convert MB to GB, Add Comma as separator in output

# Remove '...b' variables if you don't want commas as a separator

# Convert Used MB to Used GB
usedgb18=`/bin/echo $usedmb18/1024 | /usr/bin/bc -l | /bin/sed 's/^\./0./;s/0*$//;s/0*$//;s/\.$//'`

# Add comma separator
usedgb18b=`/usr/bin/printf "%'.2f\n" "$usedgb18" | /bin/sed 's/\.00$// ; s/\(\.[1-9]\)0$/\1/'`

# Convert Available MB to Available GB
availgb18=`/bin/echo $availmb18/1024 | /usr/bin/bc -l | /bin/sed 's/^\./0./;s/0*$//;s/0*$//;s/\.$//'`

# Add comma separator
availgb18b=`/usr/bin/printf "%'.2f\n" "$availgb18" | /bin/sed 's/\.00$// ; s/\(\.[1-9]\)0$/\1/'`

# Convert Total MB to Total GB
totalgb18=`/bin/echo $totalmb18/1024 | /usr/bin/bc -l | /bin/sed 's/^\./0./;s/0*$//;s/0*$//;s/\.$//'`

# Add comma separator
totalgb18b=`/usr/bin/printf "%'.2f\n" "$totalgb18" | /bin/sed 's/\.00$// ; s/\(\.[1-9]\)0$/\1/'`

# Gather Pool Name, Used MB, Avaialble MB, and Total MB for Second Pool

# Set variable to pull the Name of the pool from the output of 'nas_pool -size'.
name10=`/nas/bin/nas_pool -size id=10 | /bin/grep name | /bin/awk '{print $3}'`

# Set variable to pull the Used MB of the pool from the output of 'nas_pool -size'.
usedmb10=`/nas/bin/nas_pool -size id=10 | /bin/grep used_mb | /bin/awk '{print $3}'`

# Set variable to pull the Available MB of the pool from the output of 'nas_pool -size'.
availmb10=`/nas/bin/nas_pool -size id=10 | /bin/grep avail_mb | /bin/awk '{print $3}'`

# Set variable to pull the Total MB of the pool from the output of 'nas_pool -size'.
totalmb10=`/nas/bin/nas_pool -size id=10 | /bin/grep total_mb | /bin/awk '{print $3}'`
 
# Convert MB to GB, Add Comma as separator in output

# Remove '...b' variables if you don't want commas as a separator
 
# Convert Used MB to Used GB
usedgb10=`/bin/echo $usedmb10/1024 | /usr/bin/bc -l | /bin/sed 's/^\./0./;s/0*$//;s/0*$//;s/\.$//'`

# Add comma separator
usedgb10b=`/usr/bin/printf "%'.2f\n" "$usedgb10" | /bin/sed 's/\.00$// ; s/\(\.[1-9]\)0$/\1/'`

# Convert Available MB to Available GB
availgb10=`/bin/echo $availmb10/1024 | /usr/bin/bc -l | /bin/sed 's/^\./0./;s/0*$//;s/0*$//;s/\.$//'`

# Add comma separator
availgb10b=`/usr/bin/printf "%'.2f\n" "$availgb10" | /bin/sed 's/\.00$// ; s/\(\.[1-9]\)0$/\1/'`

# Convert Total MB to Total GB
totalgb10=`/bin/echo $totalmb10/1024 | /usr/bin/bc -l | /bin/sed 's/^\./0./;s/0*$//;s/0*$//;s/\.$//'`

# Add comma separator
totalgb10b=`/usr/bin/printf "%'.2f\n" "$totalgb10" | /bin/sed 's/\.00$// ; s/\(\.[1-9]\)0$/\1/'`

# Create Output File

# If you don't want the comma separator in the output file, substitute the variable without the 'b' at the end.

# I use the semicolon rather than the comma as a separator due to the fact that I'm using the comma as a numerical separator.

# The comma could be substituted here if desired.

/bin/echo $TODAY > /scripts/NasPool.txt
/bin/echo "Name" ";" "Total GB" ";" "Used GB" ";" "Avail GB" >> /scripts/NasPool.txt
/bin/echo $name18 ";" $totalgb18b ";" $usedgb18b ";" $availgb18b >> /scripts/NasPool.txt
/bin/echo $name10 ";" $totalgb10b ";" $usedgb10b ";" $availgb10b >> /scripts/NasPool.txt
 Here’s what the Output looks like:
Wed Jul 17 23:56:29 JST 2013
 Name (Site) ; Total GB ; Used GB ; Avail GB
 NAS_Pool0_SPA ; 4,000 ; 3,356.97 ; 643.03
 NAS_Pool1_SPB ; 3,000 ; 2,634.38 ; 365.62
 I use a cron job to schedule the report daily and copy it to our internal web server.  I then run the csv2html.pl perl script (from http://www.jpsdomain.org/source/perl.html) to convert it to an HTML output file to add to our intranet report page.

Note that I had to modify the csv2html.pl command to accomodate the use of a semicolon instead of the default comma in a csv file.  Here is the command I use to do the conversion:

./csv2htm.pl -e -T -D “;” -i /reports/NasPool.txt -o /reports/NasPool.html
 Below is what the output looks like after running the HTML conversion tool.

NASPool

Using the database query option on Celerra & VNX File commands

vnx1

EMC has added a hidden query option on some of the nas commands that allows you to directly query the NAS database. I’ve tested the ‘-query’ option n the nas_fs, nas_server, nas_slice and nas_disk commands, however it may be available on other commands as well. It’s a powerful command with a lot of options, so I took some time to play around with it today. The EMC documentation on this option in their command line reference guide is very sparse and I haven’t done much experimentation with it yet, but I thought I’d share what I’ve found so far.

You can view all of the possible query tags by typing in the commands below. There are dozens of tags available for query. The output list for all the commands is large enough that it’s not practical to add it into this blog post.

nas_fs -query:tags
nas_server -query:tags
nas_slice -query:tags
nas_disk -query:tags
 

Here’s a snippet of the tags available for nas_fs (the first seven):

Supported Query Tags for nas_fs:
Tag           Synopsis
———————————————————–
ACL         The Access Control List of the item
ACLCHKInProgress         Is ACLCHK running on this fs?
ATime         The UTC date of the item’s table last access
AccessPolicyTranslationErrorMessage         The error message of the MPD translation error, if any.
AccessPolicyTranslationPercentComplete         The percentage of translation that has been completed by the Access Policy translation thread for this item.
AccessPolicyTranslationState         The state of the Access Policy translation thread for this item.
AutoExtend         Auto extend True/False
 

The basic syntax for a query looks like this:

nas_fs -query:inuse==y:type=uxfs:IsRoot=False -Fields:Name,Id,StoragePoolName,Size,SizeValues -format:’%s,%s,%s,%d,%d\n’

In the above example, we are running the query based on three options. We want the file system to be in use, be of the type uxfs, and not be root. In the fields parameter, we want the file system’s name, ID, Storage Pool name, size, and size values. Because our output has five fields, and we want each file system to have it’s own line in the output, we add five formatting options separated by commas (for a csv type output), followed by a ‘\n’ to create a carriage return after each file system’s information has been outputted.

Here are the valid query operators:

= pattern match
== exact match
=- not less than
=+ not more than
=* any
=^ not having pattern
=^= not an exact match
=^- is less than
=^+ is greater than
=^* not any (none)

The format option is not well documented. The only parameters I’ve used for the format option are q and s. From what I’ve seen from testing the options, The tag used in the ‘-fields’ parameter is either simple or complex. A complex tag must be formatted with q, a simple tag must be formatted with s. Adding the ‘\n’ to the format option adds a carriage return to the output. If you use the wrong format parameter, it will display an error like this: “Error 2110: Invalid query syntax: the tag (‘[Tag Option]’) corresponding to this format specifier (“%q”) is not a complex tag”.

-format:’%s’ : Simple Formatting
-format:’%q’ : Complex Formatting

Below are some examples of using the query option. I will add more useful examples in the future after I’ve had more time to dive in to it.

To get the file system ID:

nas_fs -query:Name==[file system name] -format:’%s\n’ -fields:VolumeID

List all file systems with their sizes:

nas_fs -query:inuse==y:type=uxfs:IsRoot=False -Fields:Name,Id,StoragePoolName,Size,SizeValues -format:’%s,%s,%s,%d,%d\n’

List all file system quotas on the array:

nas_fs -query:\* -fields:ID,TreeQuotas -format:’%s:\n%q#\n’ -query:\* -fields:FileSystem,Path,BlockHardLimit,BlockSoftLimit,BlockGracePeriod,BlockUsage -format:’%s : %s : %s : %s : %s : %s\n’

List quota info for all file systems:

nas_fs -query:\* -fields:ID,TreeQuotas -format:’%s:\n%q#\n’ -query:\* -fields:FileSystem,Path,BlockHardLimit,BlockSoftLimit,BlockGracePeriod,BlockUsage -format:’%s : %s : %s : %s : %s : %s\n’

List all Checkpoint file systems:

nas_fs -query:inuse=y:type=ckpt:isroot=false -fields:ServersNumeric,Id,Name,SizeValues -format:’%s,%s,%s,%s\n’

List all the Server_params for a specific server_X:

nas_server -query:Name=server_2 -format:’%q’ -fields:ParamTable -query:Name== -fields:ChangeEffective,ConfiguredDefaultValue,ConfiguredValue,Current,CurrentValue,Default,Facility,IsRebootRequired,IsVisible,Name,Type,Description -format:’%s|%s|
%s|%s|%s|%s|%s|%s|%s|%s|%s|%s\n’

List the current Wins/DNSDomain Config:

nas_server -query:Name=server -fields:Name,DefaultWINS,DNSDomain -format:”%s,%s,%s\n”

For a detailed file system deduplication report:

nas_fs -query:inuse=y:isroot=false:type=uxfs -fields:Deduplications -format:’%q’ -query:RdeState=On -fields:Name,FsCapacity,SpaceSaved,SpaceReducedDataSize,DedupeRate,UnreducedDataSize,TimeOfLastScan -format:’%s,%s,%s,%s,%s,%s,%s\\n’

Disk space reporting on sub folders on a VNX File CIFS shared file system

I recently had a comment on a different post asking how to report on the size of multiple folders on a single file system, so I thought I’d share the method I use to create reports and alerts on the space that sub folders on file systems consume. There is a way to navigate to all of the file systems from the control station, simply navigate to /nas/quota/slot_<x>. Slot_<x> refers to the data mover. The file systems on server_2 would be in the slot_2 folder.  Because we have access to those folders, we can simply run the standard unix ‘du’ command to get the amount of used disk space for each sub folder on any file system.

Running this command:

sudo du -h /nas/quota/slot_2/File_System/Sub_Folder

Will give an output that looks like this:

24K     /nas/quota/slot_2/File_System/Sub_Folder/1
0       /nas/quota/slot_2/File_System/Sub_Folder/2
16K     /nas/quota/slot_2/File_System/Sub_Folder

Each sub folder of the file system named “File_System” is listed with the space each sub folder uses. You’ll notice that I used the sudo command to run the du command. Unfortunately you need root access in order to run the du command on the /nas/quota directory, so you’ll have to add whatever account you log in to the celerra with to the /etc/sudoers file. If you don’t, you’ll get the error “Sorry, user nasadmin is not allowed to execute ‘/usr/bin/du -h -s /nas/quota/slot_2/File_System’ as root on “. I generally log in to the celerra as nasadmin, so I added that account to sudoers. To modify the file, su to root first and then (using vi) add the following to the very bottom of the file of /etc/sudoers (substitute nasadmin for the username you will be using):

nasadmin ALL=/usr/bin/du, /usr/bin/, /sbin/, /usr/sbin, /nas/bin, /nas/bin/, /nas/sbin, /nas/sbin/, /bin, /bin/
nasadmin ALL=/nas/bin/server_df

Once that is complete, you’ll have access to run the command.  You can now write a script that will report on specific folders for you. I have two scripts created that I’m going to share, one I use to send out a daily email report and the other will send an email only if the folder sizes have crossed a certain threshold of space utilization. I have them both scheduled as a cron job on the Celerra itself.  It should be easy to modify these scripts for anyone’s environment.

The following script will generate a report of folder and sub folder sizes for any given file system. In this example, I am reporting on three specific subfolders on a file system (Production, Development, and Test). I also use the grep and awk options at the beginning to pull out the percent full from the df command for the entire file system and include that in the subject line of the email.

#!/bin/bash
 NAS_DB="/nas"
 export NAS_DB
 TODAY=$(date)
 HOST=$(hostname)

PERCENT=`sudo df -h /nas/quota/slot_2/File_System | grep server_2 | awk '{print $5}'`

echo "File_System Folder Size Report Date: $TODAY Host:$HOST" > /home/nasadmin/report/fs_report.txt
 echo " " >> /home/nasadmin/report/fs_report.txt
 echo "Production" >> /home/nasadmin/report/fs_report.txt
 echo " " >> /home/nasadmin/report/fs_report.txt

sudo du -h -S /nas/quota/slot_2/File_System/Prod >> /home/nasadmin/report/fs_report.txt

echo " " >> /home/nasadmin/report/fs_report.txt
 echo "Development" >> /home/nasadmin/report/fs_report.txt
 echo " " >> /home/nasadmin/report/fs_report.txt

sudo du -h -S /nas/quota/slot_2/File_System/Dev >> /home/nasadmin/report/fs_report.txt

echo " " >> /home/nasadmin/report/fs_report.txt
 echo "Test" >> /home/nasadmin/report/fs_report.txt
 echo " " >> /home/nasadmin/report/fs_report.txt

sudo du -h -S /nas/quota/slot_2/File_System/Test >> /home/nasadmin/report/fs_report.txt

echo " " >> /home/nasadmin/report/fs_report.txt
 echo "% Remaining on File_System Filesystem" >> /home/nasadmin/report/fs_report.txt
 echo " " >> /home/nasadmin/report/fs_report.txt

sudo df -h /nas/quota/slot_2/File_System/ >> /home/nasadmin/report/fs_report.txt

echo $PERCENT

mail -s "Folder Size Report ($PERCENT In Use)" youremail@domain.com < /home/nasadmin/report/fs_report.txt

Below is what the output of that script looks like. It is included in the body of the email as plain text.

File_System Folder Size Report
Date: Fri Jun 28 08:00:02 CDT 2013
Host:<celerra_name>

Production

 24K     /nas/quota/slot_2/File_System/Production/1
 0       /nas/quota/slot_2/File_System/Production/2
 16K     /nas/quota/slot_2/File_System/Production

Development

 8.0K    /nas/quota/slot_2/File_System/Development/1
 108G    /nas/quota/slot_2/File_System/Development/2
 0          /nas/quota/slot_2/File_System/Development/3
 16K     /nas/quota/slot_2/File_System/Development

Test

 0       /nas/quota/slot_2/File_System/Test/1
 422G    /nas/quota/slot_2/File_System/Test/2
 0       /nas/quota/slot_2/File_System/Test/3
 16K     /nas/quota/slot_2/File_System/Test

% Remaining on File_System Filesystem

Filesystem Size Used Avail Use% Mounted on
 server_2:/ 2.5T 529G 1.9T 22% /nasmcd/quota/slot_2
 .

The following script will only send an email if the folder size has crossed a defined threshold. Simply change the path to the folder you want to report on and change the value of the ‘threshold’ variable in the script to whatever you want it to be, I have mine set at 95%.

#Get the % Disk Utilization from the server_df command
 percent=`sudo df -h /nas/quota/slot_2/File_System | grep server_2 | awk '{print $5}'`

#Strip the % Sign from the output value
 percentvalue=`echo $percent | awk -F"%" '{print $1}'`

#Set the critical threshold that will trigger an email
 threshold=95

#compare the threshold value to the reported value, send an email if needed
 if [ $percentvalue -eq 0 ] && [ $threshold -eq 0 ]
 then
  echo "Both are zero"
 elif [ $percentvalue -eq $threshold ]
 then
  echo "Both Values are equal"
 elif [ $percentvalue -gt $threshold ]
 then
 mail -s "File_System Critical Disk Space Alert: $percentvalue% utilization is above the threshold of $threshold%" youremail@domain.com < /home/nasadmin/report/fs_report.txt

else
  echo "$percentvalue is less than $threshold"
 fi

Relocating Celerra Checkpoints

On our NS-960 Celerra we have multiple storage pools defined and one of them that is specifically defined for checkpoint storage. Out Of the 100 or so file systems that we run a checkpoint schedule on, about 2/3 of them were incorrectly writing their checkpoints to the production file system pool rather than the defined checkpoint pool, and the production pool was starting to fill up. I started to research how you could change where checkpoints are stored. Unfortunately you can’t actually relocate existing checkpoints, you need to start over.

In order to change where checkpoints are stored, you need to stop and delete any running replications (which automatically store root replication checkpoints) and delete all current checkpoints for the specific file system you’re working on. Depending on your checkpoint schedule and how often it runs you may want to pause it temporarily. Having to stop and delete remote replications was painful for me as I compete for bandwidth with our data domain backups to our DR site. Because of that I’ve been working on these one at a time.

Once you’ve deleted the relevant checkpoints and replications, you can choose where to store the new checkpoints by creating a single checkpoint first before making a schedule. From the GUI, go to Data Protection | Snapshots | Checkpoints tab, and click create. If a checkpoint does not exist for the file system, it will give you a choice of which pool you’d like to store it in. Below is what you’ll see in the popup window.

——————————————————————————–
Choose Data Mover [Drop Down List ▼]
Production File System [Drop Down List ▼]
Writeable Checkpoint [Checkbox]:
Data Movers: server_2
Checkpoint Name: [Fill in the Blank]

Configure Checkpoint Storage:
There are no checkpoints currently on this file system. Please specify how to allocate
storage for this and future checkpoints of this file system.

Create from:
*Storage Pool [Option Box]
*Meta Volume [Option Box]

——————————————————————————–
Current Storage System: CLARiiON CX4-960 APM00900900999
Storage Pool: [Drop Down List ▼]    <—*Select the appropriate Storage Pool Here*
Storage Capacity (MB): [Fill in the Blank]

Auto Extend Configuration:
High Water Mark: [Drop Down List for Percentage ▼]

Maximum Storage Capacity (MB): [Fill in the Blank]

——————————————————————————–

Close requests fail on data mover when using Riverbed Steelhead appliance

We recently had a problem with one of our corporate applications having file close requests fail, resulting in 200,000+ open files on our production data mover.  This was causing numerous issues within the application.  We determined that the problem was a result of our Riverbed Steelhead appliance requiring a certain level of DART code in order to properly close the files.  The Steelhead applicance would fail when attempting to optimize SMBV2 connections.

Because a DART code upgrade is required to resolve the problem, the only temporary fix is to reboot the data mover.  I wrote a quick script on the Celerra to grab the number of open files, write it to a text file, and publish to our internal web server.  The command to check how many open files are on the data mover is below.

This command provides all the detailed information:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Timestamp   Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
11:15:36     3379      905     9584        11        9      272        30         1856     4915   

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Summary     Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
Minimum      3379      905     9584        11        9      272        30         1856     4915   
Average      3379      905     9584        11        9      272        30         1856     4915   
Maximum      3379      905     9584        11        9      272        30         1856     4915   

Adding a grep for Maximum and using awk to grab only the last column, this command will output only the number of open files, rather than the large output above:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1 | grep Maximum | awk ‘{print $10}’

The output of that command would simply be ‘4915’ based on the sample full output I used above.

The solution number from Riverbed’s knowledgebase is S16257.  Your DART code needs to be at least 6.0.60.2 or 7.0.52.  You will also see in your steelhead logs a message similar to the one below indicating that the close request has failed for a particular file:

Sep 1 18:19:52 steelhead port[9444]: [smb2cfe.WARN] 58556726 {10.0.0.72:59207 10.0.0.72:445} Close failed for fid: 888819cd-z496-7fa2-2735-0000ffffffff with ntstatus: NT_STATUS_INVALID_PARAMETER

Collecting info on active shares, clients, protocols, and authentication on the VNX

This is a reference guide for listing and counting shares, clients, protocol and authentication information, and virus scan information.  All of this information could be used for a full VNX audit report.

How to list and count shares/exports (by protocol):

You can use the server_export command to list and count how many shares you have.

server_export server_2 -Protocol cifs -list -all:  This command will give you a list of all CIFS Shares across all of your CIFS servers.  It will count a file system twice if it’s shared on more than one CIFS server.

server_export server_2 -Protocol cifs -list -all | grep :  This will give you a list of all CIFS shares on a specific CIFS server

server_export server_2 -Protocol cifs -list -all | grep | wc:  This will give you the number of CIFS shares on a specific CIFS server.  The “wc” command will output three numbers, the first number listed is the number of shares.

server_export server_2 -Protocol nfs -list -all:  This command will give you a list of all NFS exports.  It’s just like the previous commands, you can add “| wc” at the end to get a count.

How to list client connections by OS type:

To obtain information for the number of client connections you have by OS type, you’d have to use the server_cifs audit command.

To get a full list of every active connection by client type, use this command:

server_cifs server_2 -option audit,full | grep “Client(“

The output would look like this:

|||| AUDIT Ctx=0x022a6bdc08, ref=2, W2K Client(10.0.5.161) Port=49863/445
|||| AUDIT Ctx=0x01f18efc08, ref=2, XP Client(10.0.5.42) Port=3890/445
|||| AUDIT Ctx=0x02193a2408, ref=2, W2K Client(10.0.5.61) Port=59027/445
|||| AUDIT Ctx=0x01b89c2808, ref=2, Fedora6 Client(10.0.5.194) Port=17130/445
|||| AUDIT Ctx=0x0203ae3008, ref=2, Fedora6 Client(10.0.5.52) Port=55731/445
...

In this case, if I wanted to count only the number of Fedora6 clients, I’d use the command server_cifs server_2 -option audit,full | grep “Fedora6 Client”.  I could then add “| wc” at the end to get a count.

To do a full audit report:

The command server_cifs server_2 -option audit,full will do a full, detailed audit report and should capture just about anything else you’d need.  Every connection will have a detailed audit included in the report. Based on that output, it would be easy to run the command with a grep statement to pull only the information out that you need to create custom reports.

Below is a subset of what the output looks like from that command:

|||| AUDIT Ctx=0x0177206407, ref=2, W2K8 Client(10.0.0.1) Port=65340/445
||| CIFSSERVER1[DOMAIN] on if=<interface_name>
||| CurrentDC 0x0169fee808=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=c2de9f99-1945-11e2-a512-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0x1 NTcred(0x0125fc9408 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
|| Cnxp(0x0230414c08), Name=FileSystem1, cUid=0x1 Tid=0x1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem1
| NTFSExtInfo: shareFS:fsid=49, rc=830, listFS:nb=1 [fsid=49,rc=830]

|||| AUDIT Ctx=0x0210f89007, ref=2, W2K8 Client(10.0.0.1) Port=51607/445
||| CIFSSERVER1[DOMAIN] on <interface_name>
||| CurrentDC 0x01f79aa008=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K8, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=5b410977-bace-11e1-953b-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0x1 NTcred(0x01c2367408 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
|| Cnxp(0x0195d2a408), Name=Filesystem2, cUid=0x1 Tid=0x1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem2
| NTFSExtInfo: shareFS:fsid=4214, rc=19, listFS:nb=0

|||| AUDIT Ctx=0x006aae8408, ref=2, XP Client(10.0.0.99) Port=1258/445
 ||| CIFSSERVER1[DOMAIN] on if=<interface_name>
 ||| CurrentDC 0x01f79aa008=<Domain_Controller>
 ||| Proto=NT1, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
 ||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
 ||| Uid=0x3f NTcred(0x01ccebc008 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
 || Cnxp(0x019edd7408), Name=Filesystem8, cUid=0x3f Tid=0x3f, Ref=1, Aborted=0
 | readOnly=0, umask=22, opened files/dirs=3
 | Absolute path of the share=\Filesystem8
 | NTFSExtInfo: shareFS:fsid=35, rc=43, listFS:nb=0
 | Fid=2901, FNN=0x0012d46b40(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1
 |    Notify commands received:
 |    Event=0x17, wt=1, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0x3bca, Uid=0x3f, size=0x20
 | Fid=3335, FNN=0x00193baaf0(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1\Subdirectory1
 |    Notify commands received:
 |    Event=0x17, wt=0, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0xe200, Uid=0x3f, size=0x20
 | Fid=3683, FNN=0x00290471c0(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1\Subdirectory1\Subdirectory2
 |    Notify commands received:
 |    Event=0x17, wt=0, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0x3987, Uid=0x3f, size=0x20

Adding a Celerra to a Clariion storage domain from the CLI

If you’re having trouble joining your Celerra to the storage domain from Unisphere, there is an EMC service Workaround for Joining it from the Navisphere CLI. When attempting it from Unisphere, it would appear to work and allow me to join but would never actually show up on the domain list.  Below is a workaround for the problem that worked for me. Run these commands from the Control Station.

Run this first:

curl -kv “hps://<Celerra_CS_IP>/cgi-bin/set_incomingmaster?master=<Clariion_SPA_DomainMaster_IP>,”

Next, run the following navicli command to the domain master in order to add the Celerra Control Station to the storage domain:

/nas/sbin/naviseccli -h 10.3.215.73 -user <userid> -password <password> -scope 0 domain -add 10.32.12.10

After a successful Join the /nas/http/domain folder should be populated with the domain_list, domain_master, and domain_users files.

Run this command to verify:

ls -l /nas/http/domain

You should see this output:

-rw-r–r– 1 apache apache 552 Aug  8  2011 domain_list
-rw-r–r– 1 apache apache  78 Feb 15  2011 domain_master
-rw-r–r– 1 apache apache 249 Oct  5  2011 domain_users
 

You can also check the domain list to make sure that an entry has been made.

Run this command to verify:

/nas/sbin/naviseccli -h <Clariion_SPA_DomainMaster_IP> domain -list

You should then see a list of all domain members.  The output will look like this:

Node:                     <DNS Name of Celerra>
IP Address:           <Celerra_CS_IP>
Name:                    <DNS Name of Celerra>
Port:                        80
Secure Port:          443
IP Address:           <Celerra_CS_IP>
Name:                    <DNS Name of Celerra>
Port:                        80
Secure Port:          443

Making a case for file archiving

We’ve been investigating options for archiving unstructured (file based) data that resides on our Celerra for a while now. There are many options available, but before looking into a specific solution I was asked to generate a report that showed exactly how much of the data has been accessed by users for the last 60 days and for the last 12 months.  As I don’t have permissions to the shared folders from my workstation I started looking into ways to run the report directly from the Celerra control station.  The method I used will also work on VNX File.

After a little bit of digging I discovered that you can access all of the file systems from the control station by navigating to /nas/quota/slot_.  The slot_2 folder would be for the server_2 data mover, slot_3 would be for server_3, etc.  With full access to all the file systems, I simply needed to write a script that scanned each folder and counted the number of files that had been modified within a certain time window.

I always use excel for scripts I know are going to be long.  I copy the file system list from Unisphere then put the necessary commands in different columns, and end it with a concatenate formula that pulls it all together.  If you put echo -n in A1, “Users_A,” in B1, and >/home/nasadmin/scripts/Users_A.dat in C1, you’d just need to type the formula “=CONCATENATE(A1,B1,C1)” into cell D1.  D1 would then contain echo -n “Users_A,” > /home/nasadmin/scripts/Users_A.dat. It’s a simple and efficient way to make long scripts very quickly.

In this case, the script needed four different sections.  All four of these sections I’m about to go over were copied into a single shell script and saved in my /home/nasadmin/scripts directory.  After creating the .sh file, I always do a chmod +X and chmod 777 on the file.  Be prepared for this to take a very long time to run.  It of course depends on the number of file systems on your array, but for me this script took about 23 hours to complete.

First, I create a text file for each file system that contains the name of the filesystem (and a comma) which is used later to populate the first column of the final csv output.  It’s of course repeated for each file system.

echo -n "Users_A," > home/nasadmin/scripts/Users_A.dat
echo -n "Users_B," > home/nasadmin/scripts/Users_B.dat

... <continued for each filesystem>
 Second, I use the ‘find’ command to walk each directory tree and count the number of files that were accessed over 60 days ago.  The output is written to another text file that will be used in the csv output file later.
find /nas/quota/slot_2/ Users_A -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_A_wc.dat

find /nas/quota/slot_2/ Users_B -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_B_wc.dat

... <continued for each filesystem>
 Third, I want to count the total number of files in each file system.  A third text file is written with that number, again for the final combined report that’s generated at the end.
find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

... <continued for each filesystem>
 Finally, each file is combined into the final report.  The output will show each filesystem with two columns, Total Files & Files Accessed 60 days ago.  You can then easily update the report in Excel and add columns that show files accessed in the last 60 days, the percentage of files accessed in the last 60 days, etc., with some simple math.
cat /home/nasadmin/scripts/Users_A.dat /home/nasadmin/scripts/Users_A_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_A_total.dat | tr -d "\n" > /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

cat /home/nasadmin/scripts/Users_B.dat /home/nasadmin/scripts/Users_B_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_B_total.dat | tr -d "\n" >> /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

... <continued for each filesystem>

My final output looks like this:

Total Files Accessed 60+ days ago Accessed in Last 60 days % Accessed in last 60 days
Users_A            827,057                734,848               92,209                                                   11.15
Users_B              61,975                  54,727                 7,248                                                   11.70
Users_C            150,166                132,457               17,709                                                   11.79

The three example filesystems above show that only about 11% of the files have been accessed in the last 60 days.   Most user data has a very short lifecycle, it’s ‘hot’ for a month or less then dramatically tapers off as the business value of the data drops.  These file systems would be prime candidates for archiving.

My final report definitely supported the need for archving, but we’ve yet to start a project to complete it.  I like the possibility of using EMC’s cloud tiering appliance which can archive data directly to the cloud service of your choice.  I’ll make another post in the future about archiving solutions once I’ve had more time to research it.

Can’t join CIFS Server to domain – sasl protocol violation

I was running a live disaster recovery test of our Celerra CIFS Server environment last week and I was not able to get the CIFS servers to join the replica of the domain controller on the DR network.  I would get the error ‘Sasl protocol violation’ on every attempt to join the domain.

We have two interfaces configured on the data mover, one connects to production and one connects to the DR private network.  The default route on the Celerra points to the DR network and we have static routes configured for each of our remote sites in production to allow replication traffic to pass through.  Everything on the network side checked out, I could ping DC’s and DNS servers, and NTP was configured to a DR network time server and was working.

I was able to ping the DNS Server and the domain controller:

[nasadmin@datamover1 ~]$ server_ping server_2 10.12.0.5
server_2 : 10.12.0.5 is alive, time= 0 ms
 
[nasadmin@datamover1 ~]$ server_ping server_2 10.12.18.5
server_2 : 10.12.18.5 is alive, time= 3 ms
 

When I tried to join the CIFS Server to the domain I would get this error:

[nasadmin@datamover1 ~]$ server_cifs prod_vdm_01 -Join compname=fileserver01,domain=company.net,admin=myadminaccount -option reuse prod_vdm_01 : Enter Password:********* Error 13157007706: prod_vdm_01 : DomainJoin::connect:: Unable to connect to the LDAP service on Domain Controller ‘domaincontroller.company.net’ (@10.12.0.5) for compname ‘fileserver01’. Result code is ‘Sasl protocol violation’. Error message is Sasl protocol violation.
 

I also saw this error messages during earlier tests:

Error 13157007708: prod_vdm_01 : DomainJoin::setAccountPassword:: Unable to set account password on Domain Controller ‘domaincontroller.company.net’ for compname ‘fileserver01’. Kerberos gssError is ‘Miscellaneous failure. Cannot contact any KDC for requested realm. ‘. Error message is d0000,-1765328228.
 

I noticed these error messages in the server log:

2012-06-21 07:03:00: KERBEROS: 3: acquire_accept_cred: Failed to get keytab entry for principal host/fileserver01.company.net@COMPANY.NET – error No principal in keytab matches desired name (39756033) 2012-06-21 07:03:00: SMB: 3: SSXAK=LOGON_FAILURE Client=x.x.x.x origin=510 stat=0x0,39756033 2012-06-21 07:03:42: KERBEROS: 5: Warning: send_as_request: Realm COMPANY.NET – KDC X.X.X.X returned error: Clients credentials have been revoked (18)
 

The final resolution to the problem was to reboot the data mover. EMC determined that the issue was because the kerberos keytab entry for the CIFS server was no longer valid. It could be caused by corruption or because the the machine account password expired. A reboot of the data mover causes the kerberos keytab and SPN credentials to be resubmitted, thus resolving the problem.

Errors when creating new replication jobs

I was attempting to create a new replication job on one of our VNX5500’s and was receiving several errors when selecting our DR NS-960 as the ‘destination celerra network server’.

It was displaying the following errors at the top of the window:

– “Query VDMs All.  Cannot access any Data Mover on the remote system, <celerra_name>”. The error details directed me to check that all the Data Moverss are accessible, that the time difference between the source and destination doesn’t exceed 10 min, and that the passphrase matches.  I confirmed that all of those were fine.

– “Query Storage Pools All.  Remote command failed:\nremote celerra – <celerra_name>\nremote exit status =0\nremote error = 0\nremote message = HTTP Error 500: Internal Server Error”.  The error details on this message say to search powerlink, not a very useful description.

– “There are no destination pools available”.  The details on this error say to check available space on the destination storage pool.  There is 3.5TB available in the pool I want to use on the destination side, so that wasn’t the issue either.

All existing replication jobs were still running fine so I knew there was not a network connectivity problem.  I reviewed the following items as well:

– I was able to validate all of the interconnects successfully, that wasn’t the issue.

– I ran nas_cel -update on the interconnects on both sides and received no errors, but it made no difference.

– I checked the server logs and didn’t see any errors relating to replication.

Not knowing where to look next, I opened an SR with EMC.  As it turns out, it was a security issue.

About a month ago an EMC CE accidently deleted our global security accounts during a service call.  I had recreated all of the deleted accounts and didn’t think there would be any further issues.  Logging in with the re-created nasadmin account after the accidental deletion was the root cause of the problem.  Here’s why:

The clariion global user account is tied to a local user account on the control station in /etc/passwd. When nasadmin was recreated on the domain, it attempted to create the nasadmin account on the control station as well.  Because the account already existed as a local account on the control station, it created a local account named ‘nasadmin1‘ instead, which is what caused the problem.  The two nasadmin accounts were no longer synchronized between the Celerra and the Clariion domain, so when logging in with the global nasadmin account you were no longer tied to the local nasadmin account on the control station.  Deleting all nasadmin accounts from the global domain and from the local /etc/passwd on the Celerra, and then recreating nasadmin in the domain solves the problem.  Because the issue was related only to the nasadmin account in this case, I could have also solved the problem by simply creating a new global account (with administrator priviliges) and using that to create the replication job.  I tested that as well and it worked fine.

Undocumented Celerra / VNX File commands

vnx1

The .server_config command is undocumented from EMC, I assume they don’t want customers messing with it. Use these commands at your own risk. 🙂

Below is a list of some of those undocumented commands, most are meant for viewing performance stats. I’ve had EMC support use the fcp command during a support call in the past.   When using the command for fcp stats,  I believe you need to run the ‘reset’ command first as it enables the collection of statistics.

There are likely other parameters that can be used with .server_config but I haven’t discovered them yet.

TCP Stats:

To view TCP info:
.server_config server_x -v “printstats tcpstat”
.server_config server_x -v “printstats tcpstat full”
.server_config server_x -v “printstats tcpstat reset”

Sample Output (truncated):
TCP stats :
connections initiated 8698
connections accepted 1039308
connections established 1047987
connections dropped 524
embryonic connections dropped 3629
conn. closed (includes drops) 1051582
segs where we tried to get rtt 8759756
times we succeeded 11650825
delayed acks sent 537525
conn. dropped in rxmt timeout 0
retransmit timeouts 823

SCSI Stats:

To view SCSI IO info:
.server_config server_x -v “printstats scsi”
.server_config server_x -v “printstats scsi reset”

Sample Output:
This output needs to be in a fixed width font to view properly.  I can’t seem to adjust the font, so I’ve attempted to add spaces to align it.
Ctlr: IO-pending Max-IO IO-total Idle(ms) Busy(ms) Busy(%)
0:      0         53    44925729       122348758     19159954   13%
1:      0                                           1 1 141508682       0          0%
2:      0                                           1 1 141508682       0          0%
3:      0                                           1 1 141508682       0          0%
4:      0                                           1 1 141508682       0          0%

File Stats:

.server_config server_x -v “printstats filewrite”
.server_config server_x -v “printstats filewrite full”
.server_config server_x -v “printstats filewrite reset”

Sample output (Full Output):
13108 writes of 1 blocks in 52105250 usec, ave 3975 usec
26 writes of 2 blocks in 256359 usec, ave 9859 usec
6 writes of 3 blocks in 18954 usec, ave 3159 usec
2 writes of 4 blocks in 2800 usec, ave 1400 usec
4 writes of 13 blocks in 6284 usec, ave 1571 usec
4 writes of 18 blocks in 7839 usec, ave 1959 usec
total 13310 blocks in 52397489 usec, ave 3936 usec

FCP Stats:

To view FCP stats, useful for checking SP balance:
.server_config server_x -v “printstats fcp”
.server_config server_x -v “printstats fcp full”
.server_config server_x -v “printstats fcp reset”

Sample Output (Truncated):
This output needs to be in a fixed width font to view properly.  I can’t seem to adjust the font, so I’ve attempted to add spaces to align it.
Total I/O Cmds: +0%——25%——-50%——-75%—–100%+ Total 0
FCP HBA 0 |                                                                                            | 0%  0
FCP HBA 1 |                                                                                            | 0%  0
FCP HBA 2 |                                                                                            | 0%  0
FCP HBA 3 |                                                                                            | 0%  0
# Read Cmds: +0%——25%——-50%——-75%—–100%+ Total 0
FCP HBA 0 |                                                                                            | 0% 0
FCP HBA 1 |                                                                                            | 0% 0
FCP HBA 2 |                                                                                            | 0% 0
FCP HBA 3 |  XXXXXXXXXXX                                                          | 25% 0

Usage:

‘fcp’ options are:       bind …, flags, locate, nsshow, portreset=n, rediscover=n
rescan, reset, show, status=n, topology, version

‘fcp bind’ options are:  clear=n, read, rebind, restore=n, show
showbackup=n, write

Description:

Commands for ‘fcp’ operations:
fcp bind <cmd> ……… Further fibre channel binding commands
fcp flags ………….. Show online flags info
fcp locate …………. Show ScsiBus and port info
fcp nsshow …………. Show nameserver info
fcp portreset=n …….. Reset fibre port n
fcp rediscover=n ……. Force fabric discovery process on port n
Bounces the link, but does not reset the port
fcp rescan …………. Force a rescan of all LUNS
fcp reset ………….. Reset all fibre ports
fcp show …………… Show fibre info
fcp status=n ……….. Show link status for port n
fcp status=n clear ….. Clear link status for port n and then Show
fcp topology ……….. Show fabric topology info
fcp version ………… Show firmware, driver and BIOS version

Commands for ‘fcp bind’ operations:
fcp bind clear=n ……. Clear the binding table in slot n
fcp bind read ………. Read the binding table
fcp bind rebind …….. Force the binding thread to run
fcp bind restore=n ….. Restore the binding table in slot n
fcp bind show ………. Show binding table info
fcp bind showbackup=n .. Show Backup binding table info in slot n
fcp bind write ……… Write the binding table

NDMP Stats:

To Check NDMP Status:
.server_config server_x -v “printstats vbb show”

CIFS Stats:

This will output a CIFS report, including all servers, DC’s, IP’s, interfaces, Mac addresses, and more.

.server_config server_x -v “cifs”

Sample Output:

1327007227: SMB: 6: 256 Cifs threads started
1327007227: SMB: 6: Security mode = NT
1327007227: SMB: 6: Max protocol = SMB2
1327007227: SMB: 6: I18N mode = UNICODE
1327007227: SMB: 6: Home Directory Shares DISABLED
1327007227: SMB: 6: Usermapper auto broadcast enabled
1327007227: SMB: 6:
1327007227: SMB: 6: Usermapper[0] = [127.0.0.1] state:active (auto discovered)
1327007227: SMB: 6:
1327007227: SMB: 6: Default WINS servers = 172.168.1.5
1327007227: SMB: 6: Enabled interfaces: (All interfaces are enabled)
1327007227: SMB: 6:
1327007227: SMB: 6: Disabled interfaces: (No interface disabled)
1327007227: SMB: 6:
1327007227: SMB: 6: Unused Interface(s):
1327007227: SMB: 6:  if=172-168-1-84 l=172.168.1.84 b=172.168.1.255 mac=0:60:48:1c:46:96
1327007227: SMB: 6:  if=172-168-1-82 l=172.168.1.82 b=172.168.1.255 mac=0:60:48:1c:10:5d
1327007227: SMB: 6:  if=172-168-1-81 l=172.168.1.81 b=172.168.1.255 mac=0:60:48:1c:46:97
1327007227: SMB: 6:
1327007227: SMB: 6:
1327007227: SMB: 6: DOMAIN DOMAIN_NAME FQDN=DOMAIN_NAME.net SITE=STL-Colo RC=24
1327007227: SMB: 6:  SID=S-1-5-15-7c531fd3-6b6745cb-ff77ddb-ffffffff
1327007227: SMB: 6:  DC=DCAD01(172.168.1.5) ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD02(172.168.29.8) ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD03(172.168.30.8) ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD04(172.168.28.15) ref=2 time=0 ms
1327007227: SMB: 6: >DC=SERVERDCAD01(172.168.1.122) ref=334 time=1 ms (Closest Site)
1327007227: SMB: 6: >DC=SERVERDCAD02(172.168.1.121) ref=273 time=1 ms (Closest Site)
1327007227: SMB: 6:
1327007227: SMB: 6: CIFS Server SERVERFILESEMC[DOMAIN_NAME] RC=603
1327007227: UFS: 7: inc ino blk cache count: nInoAllocs 361: inoBlk 0x0219f2a308
1327007227: SMB: 6:  Full computer name=SERVERFILESEMC.DOMAIN_NAME.net realm=DOMAIN_NAME.NET
1327007227: SMB: 6:  Comment=’EMC-SNAS:T6.0.41.3′
1327007227: SMB: 6:  if=172-168-1-161 l=172.168.1.161 b=172.168.1.255 mac=0:60:48:1c:46:9c
1327007227: SMB: 6:   FQDN=SERVERFILESEMC.DOMAIN_NAME.net (Updated to DNS)
1327007227: SMB: 6:  Password change interval: 0 minutes
1327007227: SMB: 6:  Last password change: Fri Jan  7 19:25:30 2011 GMT
1327007227: SMB: 6:  Password versions: 2, 2
1327007227: SMB: 6:
1327007227: SMB: 6: CIFS Server SERVERBKUPEMC[DOMAIN_NAME] RC=2 (local users supported)
1327007227: SMB: 6:  Full computer name=SERVERbkupEMC.DOMAIN_NAME.net realm=DOMAIN_NAME.NET
1327007227: SMB: 6:  Comment=’EMC-SNAS:T6.0.41.3′
1327007227: SMB: 6:  if=172-168-1-90 l=172.168.1.90 b=172.168.1.255 mac=0:60:48:1c:10:54
1327007227: SMB: 6:   FQDN=SERVERbkupEMC.DOMAIN_NAME.net (Updated to DNS)
1327007227: SMB: 6:  Password change interval: 0 minutes
1327007227: SMB: 6:  Last password change: Thu Sep 30 16:23:50 2010 GMT
1327007227: SMB: 6:  Password versions: 2
1327007227: SMB: 6:
 

Domain Controller Commands:

These commands are useful for troubleshooting a windows domain controller connection issue on the control station.  Use these commands along with checking the normal server log (server_log server_2) to troubleshoot that type of problem.

To view the current domain controllers visible on the data mover:

.server_config server_2 -v “pdc dump”

Sample Output (Truncated):

1327006571: SMB: 6: Dump DC for dom='<domain_name>’ OrdNum=0
1327006571: SMB: 6: Domain=<domain_name> Next trusted domains update in 476 seconds1327006571: SMB: 6:  oldestDC:DomCnt=1,179531 Time=Sat Oct 15 15:32:14 2011
1327006571: SMB: 6:  Trusted domain info from DC='<Windows_DC_Servername>’ (423 seconds ago)
1327006571: SMB: 6:   Trusted domain:<domain_name>.net [<Domain_Name>]
   GUID:00000000-0000-0000-0000-000000000000
1327006571: SMB: 6:    Flags=0x20 Ix=0 Type=0x2 Attr=0x0
1327006571: SMB: 6:    SID=S-1-5-15-d1d612b1-87382668-9ba5ebc0
1327006571: SMB: 6:    DC=’-‘
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x547,1355
1327006571: SMB: 6:   Trusted domain: <Domain_Name>
1327006571: SMB: 6:    Flags=0x22 Ix=0 Type=0x1 Attr=0x1000000
1327006571: SMB: 6:    SID=S-1-5-15-76854ac0-4c527104-321d5138
1327006571: SMB: 6:    DC=’\\<Windows_DC_Servername>’
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x0,0
1327006571: SMB: 6:   Trusted domain:<domain_name>.net [<domain_name>]
1327006571: SMB: 6:    Flags=0x20 Ix=0 Type=0x2 Attr=0x0
1327006571: SMB: 6:    SID=S-1-5-15-88d60754-f3ed4f9d-b3f2cbc4
1327006571: SMB: 6:    DC=’-‘
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x547,1355
DC=DC0x0067a82c18 <Windows_DC_Servername>[<domain_name>](10.3.0.5) ref=2 time(getdc187)=0 ms LastUpdt=Thu Jan 19 20:45:14 2012
    Pid=1000 Tid=0000 Uid=0000
    Cnx=UNSUCCESSFUL,DC state Unknown
    logon=Unknown 0 SecureChannel(s):
    Capa=0x0 Nego=0x0000000000,L=0 Chal=0x0000000000,L=0,W2kFlags=0x0
    refCount=2 newElectedDC=0x0000000000 forceInvalid=0
    Discovered from: WINS

To enable or disable a domain controller on the data mover:

.server_config server_2 -v “pdc enable=<ip_address>”  Enable a domain controller

.server_config server_2 -v “pdc disable=<ip_address>”  Disable a domain controller

MemInfo:

 .server_config server_2 -v “meminfo”

Sample Output (truncated):

CPU=0
3552907011 calls to malloc, 3540029263 to free, 61954 to realloc
Size     In Use       Free      Total nallocs nfrees
16       3738        870       4608   161720370   161716632
32      18039      17289      35328   1698256206   1698238167
64       6128       3088       9216   559872733   559866605
128       6438      42138      48576   255263288   255256850
256       8682      19510      28192   286944797   286936115
512       1507       2221       3728   357926514   357925007
1024       2947       9813      12760   101064888   101061941
2048       1086        198       1284    5063873    5062787
4096         26        138        164    4854969    4854943
8192        820         11        831   19562870   19562050
16384         23         10         33       5676       5653
32768          6          1          7        101         95
65536         12          0         12         12          0
524288          1          0          1          1          0
Total Used     Total Free    Total Used + Free
all sizes   18797440   23596160   42393600

MemOwners:

.server_config server_2 -v “help memowners”

Usage:
memowners [dump | showmap | set … ]

Description:
memowners [dump] – prints memory owner description table
memowners showmap – prints a memory usage map
memowners memfrag [chunksize=#] – counts free chunks of given size
memowners set priority=# tag=# – changes dump priority for a given tag
memowners set priority=# label=’string’ – changes dump priority for a given label
The priority value can be set to 0 (lowest) to 7 (highest).

Sample Output (truncated):

1408979513: KERNEL: 6: Memory_Owner dump.
nTotal Frames 1703936 Registered = 75,  maxOwners = 128
1408979513: KERNEL: 6:   0 (   0 frames) No owner, Dump priority 6
1408979513: KERNEL: 6:   1 (3386 frames) Free list, Dump priority 0
1408979513: KERNEL: 6:   2 (40244 frames) malloc heap, Dump priority 6
1408979513: KERNEL: 6:   3 (6656 frames) physMemOwner, Dump priority 7
1408979513: KERNEL: 6:   4 (36091 frames) Reserved Mem based on E820, Dump priority 0
1408979513: KERNEL: 6:   5 (96248 frames) Address gap based on E820, Dump priority 0
1408979513: KERNEL: 6:   6 (   0 frames) Rmode isr vectors, Dump priority 7

Auto transferring reports from VNX to an IIS web server via FTP

I previously posted on how to create a script that monitors your Celerra replication jobs.  I have an intranet web page that is updated daily with many other reports (most of which I’ve posted about here), so I thought I’d add this one to the web page as well rather than having to search through my inbox for it every day.

Developing an easy and automated method of getting files from the Celerra to a windows based web server was my challenge.  I figured out an easy way to do this with FTP.  As my internal windows web server is also my internal FTP server, I can place the file directly in the public folder for easy web publishing.  Now that I’ve got the report working and updating on the intranet page every day my next task will be to come up with a more secure method using SSH or SCP, but this works well for now.

The big challenge in creating a bash script using FTP is figuring out how to pass the user id and password.  I tried various methods unsuccessfully and finally settled on using the .netrc file.  Create an empty file named .netrc in your home directory (in my case I put it in /home/nasadmin) with the following syntax:

machine <ftp_server_name> login <ftp_login_id> password <ftp_password>

Once that is created, you need to do a chmod 600 on the .netrc file in order for it to work.  If the permissions are not set to 600 on that file the auto-login to the FTP server will fail.

My next step was to create the script that sends the replication status report to the IIS web server:

#!/bin/bash
cd /home/nasadmin/scripts
ftp <ftp_server_name> <<SCRIPT
put <filename>.csv
quit
SCRIPT
 I always chmod the script with 755 and +X after creating it in vi.  The script always ran fine manually, but I struggled for a while getting it to work properly when run from crontab.  I figured out that you must cd to the correct directory in the script before you call the ftp command, if not you will get “file not found” errors when you run it.  I was always running it manually from within that directory, so I didn’t immediately catch that problem. 🙂

I then added the above script to crontab on the Celerra.  I run it at 6AM every morning with the following entry:

0 6 * * * /scripts/repl_status.sh

For those not familiar with cron, you can add an entry using “crontab -e”, and list your current entries with “crontab -l”.  The first two entries in the line “0 6” represent minutes and the hour of each day, in this case it will run at 6:00AM every day.

I have a download link to the csv file on my web page, and I also have a script on my web server that converts the csv file to HTML output with a perl script called csv2html.pl so the data can be easily viewed without having to download the csv and open it in excel.  You can find csv2html.pl easily with a google search, I’ve blogged about it in previous posts as well.

That’s it!  An easy way to automatically push your reports to another server from the Celerra.  Now that I have the transfer method down, I’ll be adding more daily reports in the near future.  If anyone has experience doing this type of transfer from a Celerra (or Linux server) to a windows server via SSH or SCP, please comment! 🙂

Testing Disaster Recovery for VNX VDM’s and CIFS servers

After spending a few days working on a DR test recovery, I thought I’d describe the process along with a few roadblocks that I hit along the way.  We had some specific requirements that had to be met, so I thought I’d share my experiences.  Our host site has a VNX5500 and our DR site has an NS-960, and we have Celerra Replicator configured to replicate the VDM and all of the production filesystems from one site to the other.

Here were my business requirements for this test:

  1. Replicate the VDM, production CIFS server and production filesystems from the host site to DR site.
  2. Fail over (or bring up a copy of) the VDM from the host site to the DR site, mounting the replicated VDM at the DR site.
  3. Fail over (or bring up a copy of) the production CIFS server at the DR site.
  4. Create R/W checkpoints of all replicated filesystems at DR site to allow for appropriate user and application testing.
  5. Share the R/W checkpoints of the replicated filesystems on the CIFS server at the DR site rather than the original replicated filesystems, so original replicated data is not touched and does not need to be replicated again after the test.

I started off by setting up replication jobs for our VDM and all filesystems.  Once those were complete (after several weeks of data transfers) I was ready to test.

Step 1: Replicate VDM and production filesystems

This post isn’t meant to detail the process of actually setting up the initial replications, just how to get the replicated data working and accessible at your DR site.  Setting up replication is a well documented procedure which can be reviewed in EMC’s guide “Using Celerra Replicator (V2)”, P/N 300-009-989.  Once the VDMs and filesystems are replicated, you’re ready for the next step.

Step 2: Bring up the VDM at the DR site

The first step in my testing requirements is to bring up the VDM at the DR site.

Failed attempt 1:

I initially created a new replication session for the VDM as I didn’t want to use the actual production VDM, as this is a DR test and not an actual disaster.

After replicating a new copy of the VDM, I attempted to load it in the CLI with the command below.  This must be done from the CLI as there is no option to do this step in Unisphere.

nas_server –vdm <VDMNAME> -setstate loaded

It failed with this error:

Error 12066: root_fs <VDMNAME> is the source or destination object of a file system and cannot be unmounted or is the source or destination object of a VDM replication session and cannot be unloaded.

It was pretty obvious here that you need to stop the replication first before you can load the VDM.  So, as a next step, I stopped the replication with a simple right click/stop on the source side and tried again.

It failed with this error:

Error 4038: <interface_name_1> <interface_name_2> : interfaces not available on server_2

So, it looks like the interface names need to be the same.  I didn’t really want to change the interface names if I didn’t have to, so I tried a different approach next.

Failed attempt 2:

I thought this time I’d create a blank VDM on the destination side first and replicate the host VDM to it, thinking it wouldn’t keep the interface name requirement, and I still wouldn’t have to stop replication on the actual prod VDM, as I didn’t really want to use that one in a test.

I did just that. I created a blank VDM on the DR side, then started a new replication session from the host side and chose it as the destination, making sure to choose the overwrite option when I replicated to it.  The replication was successful.  I stopped the replication on the source side after it was complete, and then attempted to load the new replicated VDM on the DR side.

Voila! It worked:

nas_server –vdm <VDMNAME> -setstate loaded
            id          =          10
            name    =          vdm_replica
            acl        =          0
            type      =          vdm
            server   =          server_2
            rootfs    =          root_fs_vdm_replica
            I18N     =          UNICODE        
            Status   :
            Defined=          enabled
            Actual  =          loaded,ready

Now that it was loaded up, it was time to move on to the next step and create the R/W checkpoints of the filesystems. This is where the process failed again.

After clicking on the drop down box for “Choose Data Mover”, I got this error:

 No file systems exist

 Query file systems vdm_replica: All. File system not found. 

I’m not sure why this failed, but since the VDM couldn’t find the filesystems it was time to try another approach again.

Successful attempt:

After my first two failures, it looked pretty obvious that I’d need to change the interface names and use the original replicated VDM.  Making a copy of the VDM to a blank VDM didn’t work because it couldn’t see the filesystems, and using the original requires the interface names to be the same.  The lesson learned here is to make sure you have matching ports on your host and DR Celerras, and use the same interface names.  If I had done that, my first attempt would have been succesful.

If the original VDM has four CIFS servers (each with it’s own interface) and the DR Celerra only has one port configured on the network, you’d be out of luck.  You wouldn’t have enough interfaces to rename them all to match, and you’d never be able to load your VDM.  The VDM’s only look for the name to be the same, NOT the IP’s.  The IP’s can be different to match your DR network, and the IP’s that are already assigned to the DR site interfaces will NOT change when you load the VDM.

In my case, the host Celerra has two CIFS servers, each with it’s own interface.  One is for production, one is for backups.

Here are the steps that worked for me:

  1. Stop the replication of the VDM (You will see it change to a ‘stopped’ state in Unisphere).
  2. Change the interface names on the DR side (changing IP’s is not necessary) to match the host side.
  3. Load the VDM with the command  nas_server –vdm <VDMNAME> -setstate loaded
  4. You will see the VDM status change from ‘unloaded’ to ‘OK’.

Step 3:  Bring up the CIFS server at the DR site

After you’ve completed the previous step, the VDM will be loaded using the same exact same interfaces as production, and the CIFS servers will be automatically created as well.  If a CIFS server uses cge1-0 on server_2 on the host side, it will now be set up with the same name using cge1-0 on server_2 on the destination (DR) side.

This would be very useful in a real disaster, but for this test I wanted to create an alternate CIFS server with a different IP as the domain controller, DNS servers, and IP range used at our DR site is different.  You could choose to use the same CIFS server that was replicated with the VDM, but for our test I decided to bring up an entirely new CIFS server.  We use DFS for access all of our shares in production, so the name of the CIFS server won’t matter for our testing purposes.  We would just need to update DFS with the new name on the DR network.

Here are the steps I took to bring up the CIFS server for DR:

  1. Gather IP information from the DR team.  Will need a valid IP and subnet mask for the new CIFS server.
  2. Verify IP config on new DR network.
    1. Check that the default route matches the DR network
    2. Check that the DNS server entries match the DNS servers on the DR network
  3. Verify that the Domain controller in the DR network is up and available
  4. Modify the interface of your choice with the correct IP information for the CIFS server.
  5. Create the CIFS server and join it to the DR active directory domain.
    1. If you need to test an AD account, use this command:
    2. server_cifssupport <vdm_name> -cred -name -domain

That’s it for this step.  The CIFS server was successfully joined to the domain and I was able to ping it from one of our previously recovered windows servers on the DR network.

Step 4: Create Read/Write checkpoints of all replicated filesystems

One of my business requirements for this test was to allow read/write access to the replicated filesystems without having to actually change the production data.  The easy way to accomplish this is to create a single read/write checkpoint (snapshot) of each filesystem.  To do this, go to the checkpoint area in Unisphere, click create, and select the “Writeable Checkpoint” checkbox when you create the checkpoint.  You can also script the process and run it from the CLI on the control station.

First, create each checkpoint with this command:

nas_ckpt_schedule -create <ckpt_fs_name> -filesystem <fs_name> -recurrence once

Second, create a read/write copy of each checkpoint with this command:

fs_ckpt <ckpt_fs_name> -name <r/w_ckpt_fs_name>-Create -readonly n 

I would recommend running these no more than two a time and letting them finish.  I’ve had issues in the past running dozens of checkpoint jobs at once that hang and never complete, requiring a reboot of the data mover to correct.

Step 5: Share the replicated filesystems on the DR CIFS server

Once all of the R/W checkpoints are created, they can be shared on the DR CIFS server with the same share names as the original production share names. This allows all of our recovered application and file servers to connect to the same names, simplifying the configuration of the test environment.

You can use a CLI command to export each r/w copy to share them on your CIFS Server:

server_export [vdm] -P cifs -name [filesystem]_ckpt1 -option netbios=[cifserver] [filesystem]_ckpt1_writeable1

Step 6: Cleanup

That’s it!  We had a successful DR test.  Once the test was complete, I peformed the following cleanup steps:

  1. Remove CIFS server shares
  2. Remove CIFS server
  3. Change interfaces on DR celerra back to their original names and IP’s.
  4. Unload the replicated VDM with this command:
    1. nas_server –vdm <VDMNAME> -setstate mounted
    2. Restart the VDM replication from the source

VNX replication monitoring script

This script allows me to quickly monitor and verify the status of my replication jobs every morning.  It will generate a csv file with six columns for file system name, interconnect, estimated completion time, current transfer size,current transfer size remaining, and current write speed.

I recently added two more remote offices to our replication topology and I like to keep a daily tab on how much longer they have to complete the initial seeding, and it will also alert me to any other jobs that are running too long and might need my attention.

Step 1:

Log in to your Celerra and create a directory for the script.  I created a subdirectory called “scripts” under /home/nasadmin.

Create a text file named ‘replfs.list’ that contains a list of your replicated file systems.  You can cut and paste the list out of Unisphere.

The contents of the file should should look something like this:

Filesystem01
Filesystem02
Filesystem03
Filesystem04
Filesystem05
 Step 2:

Copy and paste all of the code into a text editor and modify it for your needs (the complete code is at the bottom of this post).  I’ll go through each section here with an explanation.

1: The first section will create a text file ($fs.dat) for each filesystem in the replfs.list file you made eariler.

for fs in `cat replfs.list`
         do
         nas_replicate -info $fs | egrep 'Celerra|Name|Current|Estimated' > $fs.dat
         done
 The output will look like this:
Name                                        = Filesystem_01
Source Current Data Port            = 57471
Current Transfer Size (KB)          = 232173216
Current Transfer Remain (KB)     = 230877216
Estimated Completion Time        = Thu Nov 24 06:06:07 EST 2011
Current Transfer is Full Copy      = Yes
Current Transfer Rate (KB/s)       = 160
Current Read Rate (KB/s)           = 774
Current Write Rate (KB/s)           = 3120
 2: The second section will create a blank csv file with the appropriate column headers:
echo 'Name,System,Estimated Completion Time,Current Transfer Size (KB),Current Transfer Remain (KB),Write Speed (KB)' > replreport.csv

3: The third section will parse all of the output files created by the first section, pulling out only the data that we’re interested in.  It places it in columns in the csv file.

         for fs in `cat replfs.list`

         do

         echo $fs","`grep Celerra $fs.dat | awk '{print $5}'`","`grep -i Estimated $fs.dat |awk '{print $5,$6,$7,$8,$9,$10}'`","`grep -i Size $fs.dat |awk '{print $6}'`","`grep -i Remain $fs.dat |awk '{print $6}'`","`grep -i Write $fs.dat |awk '{print $6}'` >> replreport.csv

        done
 If you’re not familiar with awk, I’ll give a brief explanation here.  When you grep for a certain line in the output code, awk will allow you to output only one word in the line.

For example, if you want the output of “Yes” put into a column in the csv file, but the output code line looks like “Current Transfer is Full Copy      = Yes”, then you could pull out only the “Yes” by typing in the following:

 nas_replicate -info Filesystem01 | grep  Full | awk '{print $7}'

Because the word ‘Yes’ is the 7th item in the line, the output would only contain the word Yes.

4: The final section will send an email with the csv output file attached.

uuencode replreport.csv replreport.csv | mail -s "Replication Status Report" user@domain.com

Step 3:

Copy and paste the modified code into a script file and save it.  I have mine saved in the /home/nasadmin/scripts folder. Once the file is created, make it executable by typing in chmod +X scriptfile.sh, and change the permissions with chmod 755 scriptfile.sh.

Step 4:

You can now add the file to crontab to run automatically.  Add it to cron by typing in crontab –e, to view your crontab entries type crontab –l.  For details on how to add cron entries, do a google search as there is a wealth of info available on your options.

Script Code:

for fs in `cat replfs.list`

         do

         nas_replicate -info $fs | egrep 'Celerra|Name|Current|Estimated' > $fs.dat

        done

 echo 'Name,System,Estimated Completion Time,Current Transfer Size (KB),Current Transfer Remain (KB),Write Speed (KB)' > replreport.csv

         for fs in `cat replfs.list`

         do

         echo $fs","`grep Celerra $fs.dat | awk '{print $5}'`","`grep -i Estimated $fs.dat |awk '{print $5,$6,$7,$8,$9,$10}'`","`grep -i Size $fs.dat |awk '{print $6}'`","`grep -i Remain $fs.dat |awk '{print $6}'`","`grep -i Write $fs.dat |awk '{print $6}'` >> replreport.csv

         done

 uuencode replreport.csv replreport.csv | mail -s "Replication Status Report" user@domain.com
 The final output of the script generates a report that looks like the sample below.  Filesystems that have all zeros and no estimated completion time are caught up and not currently performing a data synchronization.
Name System Estimated Completion Time Current Transfer Size (KB) Current Transfer Remain (KB) Write Speed (KB)
SA2Users_03 SA2VNX5500 0 0 0
SA2Users_02 SA2VNX5500 Wed Dec 16 01:16:04 EST 2011 211708152 41788152 2982
SA2Users_01 SA2VNX5500 Wed Dec 16 18:53:32 EST 2011 229431488 59655488 3425
SA2CommonFiles_04 SA2VNX5500 0 0 0
SA2CommonFiles_03 SA2VNX5500 Wed Dec 16 10:35:06 EST 2011 232173216 53853216 3105
SA2CommonFiles_02 SA2VNX5500 Mon Dec 14 15:46:33 EST 2011 56343592 12807592 2365
SA2commonFiles_01 SA2VNX5500 0 0 0

Celerra data mover performance and port configuration

I had a request to review my experience with data mover performance and port configuration on our production Celerras.  When I started supporting our Celerras I had no experience at all, so my current configuration is the result of trial and error troubleshooting and tackling performance problems as they appeared.

To keep this simple, I’ll review my configuration for a Celerra with only one primary data mover and one standby.  There really is no specific configuration needed on your standby data mover, just remember to perfectly match all active network ports on both primary and standby, so in the event of a failover the port configuration matches between the two.

Our primary data mover has two Ethernet modules with four ports each (for a total of eight ports).  I’ll map out how each port is configured and then explain why I did it that way.

Cge 1-0             Failsafe Config for Primary CIFS  (combined with cge1-1), assigned to ‘CIFS1’ prod file server.

Cge 1-1             Failsafe Config for Primary CIFS (combined with cge1-0), assigned to ‘CIFS1’ prod file server.

Cge 1-2             Interface configured for backup traffic, assigned to ‘CIFSBACKUP1’ server, VLAN 1.

Cge 1-3             Interface configured for backup traffic, assigned to ‘CIFSBACKUP2’ server. VLAN 1.

Cge 2-0             Interface configured for backup traffic, assigned to ‘CIFSBACKUP3’ server, VLAN 2.

Cge 2-1             Interface configured for backup traffic, assigned to ‘CIFSBACKUP4’ server, VLAN 2.

Cge 2-2             Interface configured for replication traffic, assigned to replication interconnect.

Cge 2-3             Interface configured for replication traffic, assigned to replication interconnect.

Primary CIFS Server – You do have a choice in this case to use either link aggregation or a fail safe network configuration.  Fail safe is an active/passive configuration.  If one port fails the other will take over.  I chose a fail safe configuration for several reasons, but there are good reasons to choose aggregation as well.  I chose fail safe primarily due to the ease of configuration, as there was no need for me to get the network team involved to make changes to our production switch (fail safe is configured only on the Celerra side), and our CIFS server performance requirements don’t necessitate two active links.  If you need the extra bandwidth, definitely go for aggregation.

I originally set up the fail safe network in an emergency situation, as the single interface to our prod CIFS server went down and could not be brought back online.  EMC’s answer was to reboot the data mover.  That fixed it, but it’s not such a good solution during the middle of a business day.

Backup Interfaces – We were having issues with our backups exceeding the time we had for our backup window.  In order to increase backup performance, I created four additional CIFS servers, all sharing the same file systems as production.  Our backup administrator splits the load on the four backup interfaces between multiple media servers and tape libraries (on different VLANs), and does not consume any bandwidth on the production interface that users need to access the CIFS shares.  This configuration definitely improved our backup performance.

Replication – All of our production file systems are replicated to another Celerra in a different country for disaster recovery purposes.   Because of the huge amount of data that needs to be replicated, I created two interfaces specifically for replication traffic.  Just like the backup interfaces, it separates replication traffic from the production CIFS server interface.  Even with the separate interfaces, I still have imposed a bandwidth limitation (no more than 50MB/s) in the interconnect configuration, as I need to share the same 100MB WAN link with our data domain for replication.

This configuration has proven to be very effective for me.  Our links never hit 100% utilization and I rarely get complaints about CIFS server performance.  The only real performance related troubleshooting I’ve had to do on our production CIFS servers has been related to file system deduplication, I’ve disabled it on certain file systems that see a high amount of activity.

Other thoughts about celerra configuration:

  1. We recently added a third data mover to the Celerra in our HQ data center because of the file system limitation on one data mover.  You can only have 2048 total filesystems on one data mover.  We hit that limitation due to the number of checkpoints that we keep for operational file restores.  If you make a checkpoint of one filesystem twice a day for a month, that would be 61 filesystems used against the 2048 total, which adds up quickly if you have a CIFS server filled with dozens of small shares.  I simply added another CIFS server and all new shares are now created on the new CIFS server.  The names and locations of the shares are transparent to all of our users as all file shares are presented to users with DFS links, so there were no major changes required for our Active Directory/Windows administrators.
  2. Use the Celerra monitor to keep an eye on CPU and Memory usage throughout the day.  Once you launch it from Unisphere, it runs independently of your Unisphere session (unisphere can be closed) and has a very small memory footprint on your laptop/PC.
  3. Always create your CIFS server on VDM’s, especially if you are replicating data for disaster recovery.   VDM’s are designed specifically for windows environments, allow for easy migration between data movers and allow for easy recreation of a CIFS server and it’s shares in a replication/DR scenario.  They store all the information for local groups, shares, security credentials, audit logs, and home directory info.  If you need to recreate a CIFS server from scratch, you’ll need to re-do all of those things from scratch as well.  Always use VDM’s!
  4. Write scripts for monitoring purposes.  I have only one running on my Celerras now that emails me a report of the status all replication jobs in the morning.  Of course, you can put any valid command into a bash script (adding a mailx command to email you the results), stick it in crontab, and away you go.

VMWare/ESX can’t write to a Celerra Read/Write NFS mounted datastore

I had just created serveral new Celerra NFS mounted datastores for our ESX administrator.  When he tried to create new VM hosts using the new datastores, he would get this error:   Call “FileManager.MakeDirectory” for object “FileManager” on vCenter Server “servername.company.com” failed.

Searching for that error message on powerlink, the VMWare forums, and general google searches didn’t bring back any easy answers or solutions.  It looked like ESX was unable to write to the NFS mount for some reason, even though it was mounted as Read/Write.  I also had the ESX hosts added to the R/W access permissions for the NFS export.

After much digging and experimentation, I did resolve the problem.  Here’s what you have to check:

1. The VMKernel IP must be in the root hosts permissions on the NFS export.   I put in the IP of the ESX server along with the VMKernel IP.

2. The NFS export must be mounted with the no_root_squash option.  By default, the root user with UID 0 is not given access to an NFS volume, mounting the export with no_root_squash allows the root user access.  The VMkernal must be able to access the NFS volume with UID 0.

I first set up the exports and permissions settings in the GUI, then went to the CLI to add the mount options.
command:  server_mount server_2 -option rw,uncached,sync,no_root_squash <sharename> /<sharename>

3. From within the ESX Console/Virtual Center, the Firewall settings should be updated to add the NFS Client.   Go to ‘Configuration’ | ‘Security Profile’ | ‘Properties’ | Click the NFS Client checkbox.

4. One other important item to note when adding NFS mounted datastores is the default limit of 8 in ESX.  You can increase the limit by going to ‘Configuration’ | ‘Advanced Settings’ | ‘NFS’ in the left column | Scroll to ‘NFS.MaxVolumes’ on the left, increase the number up to 64.  If you try to add a new datastore above the NFS.MaxVolumes limit, you will get the same error in red at the top of this post.

That’s it.  Adding the VMKernel IP to the root permissions, mounting with no_root_squash, and adding the NFS Client to ESX resolved the problem.

Critical Celerra FCO for Mac OS X 10.7

Here’s the description of this issue from EMC:

EMC has become aware of a compatibility issue with the new version of MacOS 10.7 (Lion). If you install this new MacOS 10.7 (Lion) release, it will cause the data movers in your Celerra to reboot which may result in data unavailability. Please note that your Celerra may encounter this issue when internal or external users operating the MacOS 10.7 (Lion) release access your Celerra system.  In order to protect against this occurrence, EMC has issued the earlier ETAs and we are proactively upgrading the NAS code operating on your affected Celerra system. EMC strongly recommends that you allow for this important upgrade as soon as possible.

Wow.  A single Mac user running 10.7 can cause a kernel panic and a reboot of your data mover.  According to our rep, Apple changed the implementation of SMB in 10.7 and added a few commands that the Celerra doesn’t recognize.  This is a big deal folks, call your Account rep for a DART upgrade if you haven’t already.

Adding/Removing modules from a datamover

I recently had an issue where a brand new datamover installed by EMC would not allow me to make it a standby for the existing datamovers.  It turns out that the hardware (specifically the number of FC and ethernet interfaces) must match PRECISELY, the number of ports and the slots the modules are installed in have to match across all datamovers.

The new datamover that was installed had an extra 4 port ethernet module installed in it.  Below is the procedure I used to remove the module, including all the commands to take it down, reconfigure it, and bring it back up successfully.  Removing the extra module solved the problem, it matched the config of the others and allowed me to configure it as a standby.

First, log in to the CLI on the control station with root priviliges.  Next, just run the commands below in order.

Turn off connecthome and emails to avoid false alarms.
 /nas/sbin/nas_connecthome -service stop
 /nas/bin/nas_emailuser -modify -enabled no
 /nas/bin/nas_emailuser -info

Copy and paste this to save, it will list the current datamover config.
 nas_server -i -a

Run this to shut the datamover down.  Run getreason to verify when it’s down.
 server_cpu server_<x> -halt now
 /nasmcd/sbin/getreason

Remove/replace the module now.

Power the datamover back on.
 /nasmcd/sbin/t2reset pwron -s <slot number>

Watch getreason for status
 /nasmcd/sbin/getreason
(Wait for it to reboot and say ‘Hardware Misconfigured’)

Once it is in a ‘misconfigured’ state, run setup_slot to configure it:
 /nasmcd/sbin/setup_slot -i 4

Run this command to view the current hardware config, verify that your change was made:
 server_sysconfig server_4 -p

Restart connecthome and email services.
 /nas/sbin/nas_connecthome -service start -clear
 /nas/sbin/nas_connecthome -i
 /nas/bin/nas_emailuser -modify -enabled yes
 /nas/bin/nas_emailuser -info

That’s it!  your datamover has been updated and reconfigured.

VNX Root Replication Checkpoints

Where did all my savvol space go?  I noticed last week that some of my Celerra replication jobs had stalled and were not sending any new data to the replication partner.  I then noticed that the storage pool designated for checkpoints was at 100%.  Not good. Based on the number of file system checkpoints that we perform, it didn’t seem possible that the pool could be filled up already.  I opened a case with EMC to help out.

I learned something new after opening this call – every time you create a replication job, a new checkpoint is created for that job and stored in the savvol.  You can view these in Unisphere by changing the “select a type” filter to “all checkpoints including replication”.  You’ll notice checkpoints named something like root_rep_ckpt_483_72715_1 in the list, they all begin with root_rep.   After working with EMC for a little while on the case, he helped me determine that one of my replication jobs had a root_rep_ckpt that was 1.5TB in size.

Removing that checkpoint would immediately solve the problem, but there was one major drawback.  Deleting the root_rep checkpoint first requires that you delete the replication job entirely, requiring a complete re-do from scratch.  The entire filesystem would have to be copied over our WAN link and resynchronized with the replication partner Celerra.  That didn’t make me happy, but there was no choice.  At least the problem was solved.

Here are a couple of tips for you if you’re experiencing a similar issue.

You can verify the storage pool the root_rep checkpoints are using by doing an info against the checkpoint from the command line and look for the ‘pool=’ field.

nas_fs –list | grep root_rep  (the first colum in the output is the ID# for the next command)

nas_fs –info id=<id from above>

 You can also see the replication checkpoints and IDs for a particular filesystem with this command:

fs_ckpt <production file system> -list –all

You can check the size of a root_rep checkpoint from the command line directly with this command:

./nas/sbin/rootnas_fs -size root_rep_ckpt_883_72715_1

 

Use the CLI to determine replication job throughput

This handy command will allow you to determine exactly how much bandwidth you are using for your Celerra replication jobs.

Run this command first, it will generate a file with the stats for all of your replication jobs:

nas_replicate -info -all > /tmp/rep.out

Run this command next:

grep "Current Transfer Rate" /tmp/rep.out |grep -v "= 0"

The output looks like this:

Current Transfer Rate (KB/s)   = 196
 Current Transfer Rate (KB/s)   = 104
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 90
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 88
 Current Transfer Rate (KB/s)   = 94
 Current Transfer Rate (KB/s)   = 89
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 108
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 118
 Current Transfer Rate (KB/s)   = 119
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 27
 Current Transfer Rate (KB/s)   = 136
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 242
 Current Transfer Rate (KB/s)   = 77
 Current Transfer Rate (KB/s)   = 218
 Current Transfer Rate (KB/s)   = 285
 Current Transfer Rate (KB/s)   = 287
 Current Transfer Rate (KB/s)   = 184
 Current Transfer Rate (KB/s)   = 224
 Current Transfer Rate (KB/s)   = 82
 Current Transfer Rate (KB/s)   = 324
 Current Transfer Rate (KB/s)   = 210
 Current Transfer Rate (KB/s)   = 328
 Current Transfer Rate (KB/s)   = 156
 Current Transfer Rate (KB/s)   = 156

Each line represents the throughput for one of your replication jobs.  Adding all of those numbers up will give you the amount of bandwidth you are consuming.  In this case, I’m using about 4.56MB/s on my 100MB link.

This same technique can of course be applied to any part of the output file.  If you want to know the estimated completion date of each of your replication jobs, you’d run this command against the rep.out file:

grep "Estimated Completion Time" /tmp/rep.out

That will give you a list of dates, like this:

Estimated Completion Time      = Fri Jul 15 02:12:53 EDT 2011
 Estimated Completion Time      = Fri Jul 15 08:06:33 EDT 2011
 Estimated Completion Time      = Mon Jul 18 18:35:37 EDT 2011
 Estimated Completion Time      = Wed Jul 13 15:24:03 EDT 2011
 Estimated Completion Time      = Sun Jul 24 05:35:35 EDT 2011
 Estimated Completion Time      = Tue Jul 19 16:35:25 EDT 2011
 Estimated Completion Time      = Fri Jul 15 12:10:25 EDT 2011
 Estimated Completion Time      = Sun Jul 17 16:47:31 EDT 2011
 Estimated Completion Time      = Tue Aug 30 00:30:54 EDT 2011
 Estimated Completion Time      = Sun Jul 31 03:23:08 EDT 2011
 Estimated Completion Time      = Thu Jul 14 08:12:25 EDT 2011
 Estimated Completion Time      = Thu Jul 14 20:01:55 EDT 2011
 Estimated Completion Time      = Sun Jul 31 05:19:26 EDT 2011
 Estimated Completion Time      = Thu Jul 14 17:12:41 EDT 2011

Very useful stuff. 🙂

 

Use the CLI to quickly determine the size of your Celerra checkpoint filesystems

Need to quickly figure out which checkpoint filesystems are taking up all of your precious savvol space?  Run the CLI command below.  Filling up the savvol storage pool can cause all kinds of problems besides failing checkpoints.  It can also cause filesystem replication jobs to fail.

To view it on the screen:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size

To save it in a file:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size > checkpoints.txt

vi checkpoints.txt   (to view the file)

Here’s a sample of the output:

UserFilesystem_01
ckpt_ckpt_UserFilesystem_01_monthly_001 :   836 : 220000
ckpt_ckpt_UserFilesystem_01_monthly_002 :   649 : 220000

UserFilesystem_02
ckpt_ckpt_UserFilesystem_02_monthly_001 :   836 : 80000
ckpt_ckpt_UserFilesystem_02_monthly_002 :   649 : 80000

The numbers are in MB.

 

Unable to provision Celerra storage?

This one really made no sense to me at first.  I was attempting to create a new file storage pool on our NS960 Celerra.  Upon launching the disk provisioning wizard, it would pause for a minute and then give the following error:

“ERROR: Unable to continue provisioning.  click for details”

It was strange because I have plenty of disks that could be used for provisioning.  Why wasn’t it working?

Here is the detailed error message:

Message Details:

Message: Unable to continue provisioning

Full Description:  Not able to fetch disk information. n Command Failed, error code: 1, output: errormessage:string=”Timeout (60 seconds) waiting for state SS_DISKS_LOADED”

Recommended Action: No recommended action is available. Go to http://powerlink.EMC.com for more information.

Event Code: 15301214354

As a workaround and to test the issue, I used 8 spare SATA drives and created a new raid group, with one large LUN using all of the space.  I added it to the Celerra storage group and rescanned the SAN for storage.

The following error popped up:

Brief Description:  Invalid credentials for the storage array APM01034413494. 
Full Description:  The FLARE version on this storage array requires secure communication. Saved credentials are found, but authentication failed. The Control Station does not have valid credentials. 
Recommended Action:  Set the credentials by running the “nas_storage -modify <backend_name> -security” command. 
Message ID:  13422231564 

Well, how about that.  An error message that actually gives you the command to resolve the problem. 🙂  As it turns out, one of our other SAN administrators had changed the password for the system account.  Running nas_storage -modify id=<xx> -security resolved the problem.

(Note: You can get the ID number by running nas_storage -list)