Tag Archives: replication

Data Domain CLI Command Reference Guide

Other CLI Reference Guides:
Isilon CLI  |  EMC ECS CLI  |  VNX NAS CLI | ViPR Controller CLINetApp Clustered ONTAP CLI  |  Brocade FOS CLI | EMC XTremIO CLI

This is a Data Domain CLI Command Reference Guide for the commands that are more commonly used.

If you’re looking to automate reports for your Data Domain, see my post Easy Reporting on Data Domain using the Autosupport Log.

# alerts notify-list create <group-name> Creates a notification list and subscribes to events belonging to the specified list of classes and severity levels.
# alerts notify-list add <group-name> Adds to a notification list and subscribes to events belonging to the specified list of classes and severity levels.
# alerts notify-list del <group-name> Deletes members from a notification list, a list of classes, a list of email addresses.
# alerts notify-list destroy <group-name> Destroys a notification list
# alerts notify-list reset Resets all notification lists to factory default
# alerts notify-list show Shows notification lists’ configuration
# alerts notify-list test Sends a test notification to alerts notify-list
# cifs share create share path {max-connections max connections | clients clients | users users | comment comment}
# cifs status Check CIFS Status
# cifs disable Disable CIFS Service
# cifs enable Enable CIFS Service
# nfs add path client-list [(option-list)] Add NFS clients to an Export
# nfs show active List clients active in the past 15 minutes and the mount path for each
# nfs show clients list NFS clients allowed to access the Data Domain system and the mount path and NFS options for each
# nfs show detailed-stats display NFS cache entries and status to facilitate troubleshooting
# NFS Status Display NFS Status
# NFS Enable Enable NFS Service
# NFS disable Disable NFS Service
DD Boost
# ddboost enable Enable DDBoost
# ddboost status show DDBoost status
# ddboost set user-name <user-name> Set DD Boost user
# ddboost access add clients <client-list> Add clients to DD Boost access list
# ddboost storage-unit create <storage-unit-name> Create storage-unit, setting quota limits
# ddboost storage-unit delete <storage-unit-name> Delete storage-unit
# ddboost storage-unit show [compression] [<storage-unit-name>] List the storage-units and images in a storage-unit:
# ddboost storage-unit create <storage-unit> user <user-name> Create a storage unit, assign tenant, and set quota and stream limits
# ddboost storage-unit delete <storage-unit> Delete a specified storage unit, its contents, and any DD Boost assocaitions
# ddboost storage-unit rename <storage-unit> <new-storage-unit> Rename a storage-unit
# ddboost storage-unit undelete <storage-unit> Recover a deleted storage unit
# ddboost option reset Reset DD Boost options
# ddboost option set distributed-segment-processing {enabled|disabled} Enable or disable distributed-segment-processing for DD Boost
# ddboost option set virtual-synthetics {enabled | disabled} Enable or disable virtual-synthetics for DD Boost
# ddboost option show Show DD Boost options
# ddboost option set fc {enabled | disabled} Enable or disable fibre-channel for DD Boost
# ddboost fc dfc-server-name set DDBoost Fibre-Channel set Server Name
# ddboost fc dfc-server-name show Show DDBoost Fibre-Channel Server Name
# ddboost fc status DDBoost Fibre Channel Status
# ddboost fc group show list [<group-spec>] [initiator<initiator-spec>] List configured DDBoost FC groups
# ddboost fc group create <group-name> Create a DDBoost FC group
# ddboost fc group add <group-name> initiator <initiator-spec> Add initiators to a DDBoost FC group
# ddboost fc group add <group-name> device-set Add DDBoost devices to a DDBoost FC group
Encryption and File system Locking
# filesys enable Enables the file system
# filesys disable Disables the file system
# filesys encryption enable Enables encryption. Enter a passphrase when prompted
# filesys encryption disable Disables encryption.
# filesys encryption show Checks the status of the encryption feature
# filesys encryption lock Locks the system by creating a new passphrase and destroying the cached copy of existing passphrase
# filesys encryption passphrase change Changes the passphrase for system encryption keys
# filesys encryption unlock Prepares the encrypted file system for use after it has arrived at its destination
# license add <license-code> [<license-code> …] Adds one or more licenses for features and storage capacity.
# license show [local] Displays license codes currently installed.
# license del <license-code> Deletes one or more licenses.
# license reset Removes all licenses and requires confirmation before deletion.
# net show settings Displays the interface’s network settings
# net show hardware Displays the interface’s hardware configuration
# net show config Displays the active network configuration
# net show domainname Displays the domain name associated with this device
# net show searchdomain Lists the domains that will be searched when only the host name is provided for a r command
# net show dns Lists the domain name servers used by this device.
# net show stats Provides a number of different networking statistics
# net show all Combines the output of several other net show CLI commands
Replication, Throttling, LBO, Encryption
# replication enable {<destination> | all} Enables replication
# replication disable {<destination> | all} Disables replication
# replication add source <source> destination <destination> Creates a replication pair
# replication break {<destination> | all} Removes the source or destination DD system from a replication pair
# replication initialize <destination> Initialize replication on the source (configure both source and destination first)
# replication modify <destination> {source-host | destination-host} <new-host-name> Modifies connection host, hostname
# replication modify <destination> connection-host <new-host-name> [port <port>] Modifies port
# replication add … low-bw-optim enabled Adds LBO
# replication modify … low-bw-optim enabled Modify LBO
# replication modify … low-bw-optim disabled Disable
# replication add … encryption enabled Add encryption over wire
# replication modify … encryption enabled Enable encryption over wire
# replication modify … encryption disabled Disable encryption over wire
# replication option set listen-port <port> Modify listening port  [context must be disabled before the connection port can be modified]
# replication option reset listen-port Reset listening port  [context must be disabled before the connection port can be modified]
# replication throttle add <sched-spec> <rate> Add a throttle schedule
# replication throttle add destination <host> <sched-spec> <rate> Add a destination specific throttle
# replication throttle del <sched-spec> Delete a throttle schedule
# replication throttle reset {current | override | schedule | all} Reset throttle configuration
# replication throttle set current <rate> Set a current override
# replication throttle set override <rate> Set a permanent override
# replication throttle show [KiB] Show throttle onfiguration
Retention Lock
# mtree retention-lock enable mtree_name Enables the retention-lock feature for the specified MTree
# mtree retention-lock disable mtree_name Disables the retention-lock feature for the specified MTree
# mtree retention-lock reset Resets the value of the retention period for the specified MTree to its default
# mtree retention-lock revert Reverts the retention lock for all files on a specified path
# mtree retention-lock set Sets the minimum or maximum retention period for the specified MTree
# mtree retention-lock show Shows the minimum or maximum retention period for the specified MTree
#mtree retention-lock status mtree_name Shows the retention-lock status for the specified MTree
#system sanitize abort Aborts the sanitization process
#system sanitize start Starts sanitization process immediately
#system sanitize status Shows current sanitization status
#system sanitize watch Monitors sanitization progress
SMT MTree stats
# mtree list List List the Mtrees on a Data Domain system
# mtree show stats Collect MTree real-time performance statistics
# mtree show performance Collect performance statistics for MTrees associated with a tenant-unit
# mtree show compression Collect compression statistics for MTrees associated with a tenant-unit
# quota capacity show List capacity quotas for MTrees and storage-units
# ddboost storage-unit modify Adjust or modify the quotas after the initial configuration
System Performance
# system show stats interval [interval in seconds] Shows system stats (Disk, IOs,…etc)
# system show performance [ {hr | min | sec} [ {hr | min | sec} ]] Show System Performance
# ndmpd enable Enable the NDMP daemon
# ndmpd show devicenames Verify that the NDMP daemon sees the devices created in the TapeServer access group
# ndmpd user add ndmp Add an NDMP user
# ndmpd option show all Check the options for the ndmpd daemon
# ndmpd option set authentication md5 Set the ndmpd service authentication to MD5
# ndmpd option show all Verify the serivce authentication

Diving in to Isilon SyncIQ and SnapshotIQ Management

In this post I’m going to review the most useful commands for managing SyncIQ replication jobs and SnapshotIQ snapshots on the Isilon.  While this will primarily be a CLI administration reference, I’ll look at some WebUI options as well when I get to Snapshots, as well as some additional notes and caveats regarding snapshot management.  I’d highly recommend reviewing EMC’s SnapshotIQ best practices page, as well as the SyncIQ best practices guide if you’re just starting a new implementation.  For a complete Isilon Command line reference you can reference this post.

Creating a Replication policy

# isi sync policies create sync –schedule “” –target-snapshot-archive on –target-snapshot-pattern “%{PolicyName}-%{SrcCluster}-%Y-%m-%d_%H-%M”

Viewing active replication jobs

# isi sync jobs list

Policy Name ID State Action Duration
 Replica1 32375 running run 1M1W5D14H55m
 Total: 1

# isi sync jobs view

Policy Name: Replica1
 ID: 32375
 State: running
 Action: run
 Duration: 1M1W5D14H55m9s
 Start Time: 2017-10-27T17:00:25

# isi_classic sync job rep

Name | Act | St | Duration | Transfer | Throughput
 Replica1 | sync | Running | 42 days 14:59:23 | 3.0 TB | 6.8 Mb/s

# isi_classic sync job rep –v [Provides a more verbose report]

Creating a SyncIQ domain [Required for failback operations]

# isi job jobs start –root –dm-type SyncIQ

Reviewing a replication Job before starting it

Replication policy status can be reviewed with the ‘test’ option. It is useful for previewing the size of the data set that will be transferred if you run the policy.

# isi sync jobs start –test
# isi sync reports view 1

Replication policy Enable/Disable/Delete

# isi sync policies enable # isi sync policies disable # isi sync policies delete

Replication Job Management

# isi sync jobs start # isi sync jobs pause # isi sync jobs resume # isi sync jobs cancel

Replication Policy Management

# isi sync policies list
# isi sync policies view

Viewing replication policies that target the local cluster

# isi sync target list
# isi sync target view

Managing replication performance rules

# isi sync rules create

Create network traffic rules that limit replication bandwidth

# isi sync rules create bandwidth 00:00-23:59 Sun-Sat 19200 [Limit consumption to 19200 kbps per second, 24×7]
# isi sync rules create file_count 08:00-18:00 M-F 7 [Limit file-send rate to 7 files per second 8-6 on weekdays]

Managing replication performance rules

# isi sync rules list
# isi sync rules view –id bw-0
# isi sync rules modify bw-0 –enabled true
# isi sync rules modify bw-0 –enabled false

Managing replication reports

# isi sync reports list
# isi snapshots list | head -200 [list the first 200 snapshots]
# isi sync reports view 2
# isi sync reports subreports list 1 [view sub-reports]

Managing failed replication jobs

# isi sync policies resolve [Resolve a policy error]
# isi sync policies reset If the issue can’t be resolved, the job can be reset. Resetting a policy results in a full or differential replication the next time the policy is run.

Creating Snapshots

# isi snapshot snapshots create

# isi snapshot snapshots delete {|–schedule |–type {alias|real}|–all}
[{–force|-f}] [{–verbose|-v}]

Modifing Snapshots

# isi snapshot snapshots modify

Listing Snapshots

# isi snapshot snapshots list –state {all | active | deleting}
# isi snapshot snapshots list –limit | -l [Number of snapshots to display]
# isi snapshot snapshots list –descending | -d [Sort data in descending order]

Viewing Snapshots

# isi snapshot snapshots view

Deleting Snapshots

Deleting a snapshot from OneFS is an all-or-nothing event, an existing snapshot cannot be partially deleted. Snapshots are created at the directory level, not at the volume level, which allows for a higher degree of granularity. Because they are a point in time copy of a specific subset of OneFS data they can’t be changed, only fully deleted. When deleting a snapshot OneFS immediately modifies some of the tracking data and the snapshot dissappears from view. Despite the fact that the snap is no longer visible, the behind the scenes cleanup of the snapshot will still be pending. It is performed in the ‘SnapshotDelete’ job.

OneFS frees disk space occupied by deleted snapshots only when the snapshot delete job is run. If a snapshot is deleted that contains clones or cloned files, the data in a shadow store may no longer be referenced by files on the cluster. OneFS deletes unreferenced data in a shadow store when the shadow store delete job is run. OneFS automatically runs both the shadow store delete and snapshot delete jobs, but you can also run them manually any time. Follow the procedure below to force the snapshot delete job to more quickly reclaim array capacity.

Deleting Snapshots from the WebUI

Go to Data Protection > SnapshotIQ > Snapshots and specify the snapshots that you want to delete.

• For each snapshot you want to delete, in the Saved File System Snapshots table, in the row of a snapshot, select the check box.
• From the Select an action list, select Delete.
• In the confirmation dialog box, click Delete.
• Note that you can select more than one snapshot at a time, and clicking the delete button on any of the snapshots will result in the entire checked list being deleted.
• If you have a large number of snapshots and want to delete them all, you can run a command from the CLI that will delete all of them at once: isi snapshot snapshots delete –all.

Increasing the Speed of Snapshot Deletion from the WebUI

It’s important to note that the SnapshotDelete will only run if the cluster is in a fully available state. There can be no drives or nodes down and it cannot be in a degraded state. To increase the speed at which deleted snapshot data is freed on the cluster, run the snapshot delete job.

• Go to Cluster Management > Operations.
• In the Running Jobs area, click Start Job.
• From the Job list, select SnapshotDelete.
• Click Start.

Increasing the Speed of Cloned File deletion from the WebUI

Run the shadow store delete job only after you run the snapshot delete job.

• Go to Cluster Management > Operations.
• In the Running Jobs area, click Start Job.
• From the Job list, select ShadowStoreDelete.
• Click Start.

Reserved Space

There is no requirement for reserved space for snapshots in OneFS. Snapshots can use as much or little of the available file system space as desirable. The oldest snapshot can be deleted very quickly. An ordered deletion is the deletion of the oldest snapshot of a directory, and is a recommended best practice for snapshot management. An unordered deletion is the removal of a snapshot that is not the oldest in a directory, and can often take approximately twice as long to complete and consume more cluster resources than ordered deletions.

The Delete Sequence Matters

As I just mentioned, avoid deleting snapshots from the middle of a time range whenever possible. Newer snapshots are mostly pointers to older snapshots, and they look like they are consuming more capacity than they actually are. Removing the newer snapshots will not free up much space, while deleting the oldest snapshot will ensure you are actually freeing up the space. You can determine snapshot order by using the isi snapshot list -l command.

Watch for SyncIQ Snaps

Avoid deleting SyncIQ snapshots if possible. They are easily identifiable, as they will all be prefixed with SIQ. It is ok to delete them if they are the only remaining snapshots on the cluster, and the only way to free up space is to delete them. Be aware that deleting SyncIQ snapshots resets the SyncIQ policy state, which requires a reset of the policy and may result in either a full sync or initial differential sync. A full sync or initial diff sync could take many times longer than a regular snapshot-based incremental sync.

Errors when creating new replication jobs

I was attempting to create a new replication job on one of our VNX5500’s and was receiving several errors when selecting our DR NS-960 as the ‘destination celerra network server’.

It was displaying the following errors at the top of the window:

– “Query VDMs All.  Cannot access any Data Mover on the remote system, <celerra_name>”. The error details directed me to check that all the Data Moverss are accessible, that the time difference between the source and destination doesn’t exceed 10 min, and that the passphrase matches.  I confirmed that all of those were fine.

– “Query Storage Pools All.  Remote command failed:\nremote celerra – <celerra_name>\nremote exit status =0\nremote error = 0\nremote message = HTTP Error 500: Internal Server Error”.  The error details on this message say to search powerlink, not a very useful description.

– “There are no destination pools available”.  The details on this error say to check available space on the destination storage pool.  There is 3.5TB available in the pool I want to use on the destination side, so that wasn’t the issue either.

All existing replication jobs were still running fine so I knew there was not a network connectivity problem.  I reviewed the following items as well:

– I was able to validate all of the interconnects successfully, that wasn’t the issue.

– I ran nas_cel -update on the interconnects on both sides and received no errors, but it made no difference.

– I checked the server logs and didn’t see any errors relating to replication.

Not knowing where to look next, I opened an SR with EMC.  As it turns out, it was a security issue.

About a month ago an EMC CE accidently deleted our global security accounts during a service call.  I had recreated all of the deleted accounts and didn’t think there would be any further issues.  Logging in with the re-created nasadmin account after the accidental deletion was the root cause of the problem.  Here’s why:

The clariion global user account is tied to a local user account on the control station in /etc/passwd. When nasadmin was recreated on the domain, it attempted to create the nasadmin account on the control station as well.  Because the account already existed as a local account on the control station, it created a local account named ‘nasadmin1‘ instead, which is what caused the problem.  The two nasadmin accounts were no longer synchronized between the Celerra and the Clariion domain, so when logging in with the global nasadmin account you were no longer tied to the local nasadmin account on the control station.  Deleting all nasadmin accounts from the global domain and from the local /etc/passwd on the Celerra, and then recreating nasadmin in the domain solves the problem.  Because the issue was related only to the nasadmin account in this case, I could have also solved the problem by simply creating a new global account (with administrator priviliges) and using that to create the replication job.  I tested that as well and it worked fine.

Testing Disaster Recovery for VNX VDM’s and CIFS servers

After spending a few days working on a DR test recovery, I thought I’d describe the process along with a few roadblocks that I hit along the way.  We had some specific requirements that had to be met, so I thought I’d share my experiences.  Our host site has a VNX5500 and our DR site has an NS-960, and we have Celerra Replicator configured to replicate the VDM and all of the production filesystems from one site to the other.

Here were my business requirements for this test:

  1. Replicate the VDM, production CIFS server and production filesystems from the host site to DR site.
  2. Fail over (or bring up a copy of) the VDM from the host site to the DR site, mounting the replicated VDM at the DR site.
  3. Fail over (or bring up a copy of) the production CIFS server at the DR site.
  4. Create R/W checkpoints of all replicated filesystems at DR site to allow for appropriate user and application testing.
  5. Share the R/W checkpoints of the replicated filesystems on the CIFS server at the DR site rather than the original replicated filesystems, so original replicated data is not touched and does not need to be replicated again after the test.

I started off by setting up replication jobs for our VDM and all filesystems.  Once those were complete (after several weeks of data transfers) I was ready to test.

Step 1: Replicate VDM and production filesystems

This post isn’t meant to detail the process of actually setting up the initial replications, just how to get the replicated data working and accessible at your DR site.  Setting up replication is a well documented procedure which can be reviewed in EMC’s guide “Using Celerra Replicator (V2)”, P/N 300-009-989.  Once the VDMs and filesystems are replicated, you’re ready for the next step.

Step 2: Bring up the VDM at the DR site

The first step in my testing requirements is to bring up the VDM at the DR site.

Failed attempt 1:

I initially created a new replication session for the VDM as I didn’t want to use the actual production VDM, as this is a DR test and not an actual disaster.

After replicating a new copy of the VDM, I attempted to load it in the CLI with the command below.  This must be done from the CLI as there is no option to do this step in Unisphere.

nas_server –vdm <VDMNAME> -setstate loaded

It failed with this error:

Error 12066: root_fs <VDMNAME> is the source or destination object of a file system and cannot be unmounted or is the source or destination object of a VDM replication session and cannot be unloaded.

It was pretty obvious here that you need to stop the replication first before you can load the VDM.  So, as a next step, I stopped the replication with a simple right click/stop on the source side and tried again.

It failed with this error:

Error 4038: <interface_name_1> <interface_name_2> : interfaces not available on server_2

So, it looks like the interface names need to be the same.  I didn’t really want to change the interface names if I didn’t have to, so I tried a different approach next.

Failed attempt 2:

I thought this time I’d create a blank VDM on the destination side first and replicate the host VDM to it, thinking it wouldn’t keep the interface name requirement, and I still wouldn’t have to stop replication on the actual prod VDM, as I didn’t really want to use that one in a test.

I did just that. I created a blank VDM on the DR side, then started a new replication session from the host side and chose it as the destination, making sure to choose the overwrite option when I replicated to it.  The replication was successful.  I stopped the replication on the source side after it was complete, and then attempted to load the new replicated VDM on the DR side.

Voila! It worked:

nas_server –vdm <VDMNAME> -setstate loaded
            id          =          10
            name    =          vdm_replica
            acl        =          0
            type      =          vdm
            server   =          server_2
            rootfs    =          root_fs_vdm_replica
            I18N     =          UNICODE        
            Status   :
            Defined=          enabled
            Actual  =          loaded,ready

Now that it was loaded up, it was time to move on to the next step and create the R/W checkpoints of the filesystems. This is where the process failed again.

After clicking on the drop down box for “Choose Data Mover”, I got this error:

 No file systems exist

 Query file systems vdm_replica: All. File system not found. 

I’m not sure why this failed, but since the VDM couldn’t find the filesystems it was time to try another approach again.

Successful attempt:

After my first two failures, it looked pretty obvious that I’d need to change the interface names and use the original replicated VDM.  Making a copy of the VDM to a blank VDM didn’t work because it couldn’t see the filesystems, and using the original requires the interface names to be the same.  The lesson learned here is to make sure you have matching ports on your host and DR Celerras, and use the same interface names.  If I had done that, my first attempt would have been succesful.

If the original VDM has four CIFS servers (each with it’s own interface) and the DR Celerra only has one port configured on the network, you’d be out of luck.  You wouldn’t have enough interfaces to rename them all to match, and you’d never be able to load your VDM.  The VDM’s only look for the name to be the same, NOT the IP’s.  The IP’s can be different to match your DR network, and the IP’s that are already assigned to the DR site interfaces will NOT change when you load the VDM.

In my case, the host Celerra has two CIFS servers, each with it’s own interface.  One is for production, one is for backups.

Here are the steps that worked for me:

  1. Stop the replication of the VDM (You will see it change to a ‘stopped’ state in Unisphere).
  2. Change the interface names on the DR side (changing IP’s is not necessary) to match the host side.
  3. Load the VDM with the command  nas_server –vdm <VDMNAME> -setstate loaded
  4. You will see the VDM status change from ‘unloaded’ to ‘OK’.

Step 3:  Bring up the CIFS server at the DR site

After you’ve completed the previous step, the VDM will be loaded using the same exact same interfaces as production, and the CIFS servers will be automatically created as well.  If a CIFS server uses cge1-0 on server_2 on the host side, it will now be set up with the same name using cge1-0 on server_2 on the destination (DR) side.

This would be very useful in a real disaster, but for this test I wanted to create an alternate CIFS server with a different IP as the domain controller, DNS servers, and IP range used at our DR site is different.  You could choose to use the same CIFS server that was replicated with the VDM, but for our test I decided to bring up an entirely new CIFS server.  We use DFS for access all of our shares in production, so the name of the CIFS server won’t matter for our testing purposes.  We would just need to update DFS with the new name on the DR network.

Here are the steps I took to bring up the CIFS server for DR:

  1. Gather IP information from the DR team.  Will need a valid IP and subnet mask for the new CIFS server.
  2. Verify IP config on new DR network.
    1. Check that the default route matches the DR network
    2. Check that the DNS server entries match the DNS servers on the DR network
  3. Verify that the Domain controller in the DR network is up and available
  4. Modify the interface of your choice with the correct IP information for the CIFS server.
  5. Create the CIFS server and join it to the DR active directory domain.
    1. If you need to test an AD account, use this command:
    2. server_cifssupport <vdm_name> -cred -name -domain

That’s it for this step.  The CIFS server was successfully joined to the domain and I was able to ping it from one of our previously recovered windows servers on the DR network.

Step 4: Create Read/Write checkpoints of all replicated filesystems

One of my business requirements for this test was to allow read/write access to the replicated filesystems without having to actually change the production data.  The easy way to accomplish this is to create a single read/write checkpoint (snapshot) of each filesystem.  To do this, go to the checkpoint area in Unisphere, click create, and select the “Writeable Checkpoint” checkbox when you create the checkpoint.  You can also script the process and run it from the CLI on the control station.

First, create each checkpoint with this command:

nas_ckpt_schedule -create <ckpt_fs_name> -filesystem <fs_name> -recurrence once

Second, create a read/write copy of each checkpoint with this command:

fs_ckpt <ckpt_fs_name> -name <r/w_ckpt_fs_name>-Create -readonly n 

I would recommend running these no more than two a time and letting them finish.  I’ve had issues in the past running dozens of checkpoint jobs at once that hang and never complete, requiring a reboot of the data mover to correct.

Step 5: Share the replicated filesystems on the DR CIFS server

Once all of the R/W checkpoints are created, they can be shared on the DR CIFS server with the same share names as the original production share names. This allows all of our recovered application and file servers to connect to the same names, simplifying the configuration of the test environment.

You can use a CLI command to export each r/w copy to share them on your CIFS Server:

server_export [vdm] -P cifs -name [filesystem]_ckpt1 -option netbios=[cifserver] [filesystem]_ckpt1_writeable1

Step 6: Cleanup

That’s it!  We had a successful DR test.  Once the test was complete, I peformed the following cleanup steps:

  1. Remove CIFS server shares
  2. Remove CIFS server
  3. Change interfaces on DR celerra back to their original names and IP’s.
  4. Unload the replicated VDM with this command:
    1. nas_server –vdm <VDMNAME> -setstate mounted
    2. Restart the VDM replication from the source

VNX replication monitoring script

This script allows me to quickly monitor and verify the status of my replication jobs every morning.  It will generate a csv file with six columns for file system name, interconnect, estimated completion time, current transfer size,current transfer size remaining, and current write speed.

I recently added two more remote offices to our replication topology and I like to keep a daily tab on how much longer they have to complete the initial seeding, and it will also alert me to any other jobs that are running too long and might need my attention.

Step 1:

Log in to your Celerra and create a directory for the script.  I created a subdirectory called “scripts” under /home/nasadmin.

Create a text file named ‘replfs.list’ that contains a list of your replicated file systems.  You can cut and paste the list out of Unisphere.

The contents of the file should should look something like this:

 Step 2:

Copy and paste all of the code into a text editor and modify it for your needs (the complete code is at the bottom of this post).  I’ll go through each section here with an explanation.

1: The first section will create a text file ($fs.dat) for each filesystem in the replfs.list file you made eariler.

for fs in `cat replfs.list`
         nas_replicate -info $fs | egrep 'Celerra|Name|Current|Estimated' > $fs.dat
 The output will look like this:
Name                                        = Filesystem_01
Source Current Data Port            = 57471
Current Transfer Size (KB)          = 232173216
Current Transfer Remain (KB)     = 230877216
Estimated Completion Time        = Thu Nov 24 06:06:07 EST 2011
Current Transfer is Full Copy      = Yes
Current Transfer Rate (KB/s)       = 160
Current Read Rate (KB/s)           = 774
Current Write Rate (KB/s)           = 3120
 2: The second section will create a blank csv file with the appropriate column headers:
echo 'Name,System,Estimated Completion Time,Current Transfer Size (KB),Current Transfer Remain (KB),Write Speed (KB)' > replreport.csv

3: The third section will parse all of the output files created by the first section, pulling out only the data that we’re interested in.  It places it in columns in the csv file.

         for fs in `cat replfs.list`


         echo $fs","`grep Celerra $fs.dat | awk '{print $5}'`","`grep -i Estimated $fs.dat |awk '{print $5,$6,$7,$8,$9,$10}'`","`grep -i Size $fs.dat |awk '{print $6}'`","`grep -i Remain $fs.dat |awk '{print $6}'`","`grep -i Write $fs.dat |awk '{print $6}'` >> replreport.csv

 If you’re not familiar with awk, I’ll give a brief explanation here.  When you grep for a certain line in the output code, awk will allow you to output only one word in the line.

For example, if you want the output of “Yes” put into a column in the csv file, but the output code line looks like “Current Transfer is Full Copy      = Yes”, then you could pull out only the “Yes” by typing in the following:

 nas_replicate -info Filesystem01 | grep  Full | awk '{print $7}'

Because the word ‘Yes’ is the 7th item in the line, the output would only contain the word Yes.

4: The final section will send an email with the csv output file attached.

uuencode replreport.csv replreport.csv | mail -s "Replication Status Report" user@domain.com

Step 3:

Copy and paste the modified code into a script file and save it.  I have mine saved in the /home/nasadmin/scripts folder. Once the file is created, make it executable by typing in chmod +X scriptfile.sh, and change the permissions with chmod 755 scriptfile.sh.

Step 4:

You can now add the file to crontab to run automatically.  Add it to cron by typing in crontab –e, to view your crontab entries type crontab –l.  For details on how to add cron entries, do a google search as there is a wealth of info available on your options.

Script Code:

for fs in `cat replfs.list`


         nas_replicate -info $fs | egrep 'Celerra|Name|Current|Estimated' > $fs.dat


 echo 'Name,System,Estimated Completion Time,Current Transfer Size (KB),Current Transfer Remain (KB),Write Speed (KB)' > replreport.csv

         for fs in `cat replfs.list`


         echo $fs","`grep Celerra $fs.dat | awk '{print $5}'`","`grep -i Estimated $fs.dat |awk '{print $5,$6,$7,$8,$9,$10}'`","`grep -i Size $fs.dat |awk '{print $6}'`","`grep -i Remain $fs.dat |awk '{print $6}'`","`grep -i Write $fs.dat |awk '{print $6}'` >> replreport.csv


 uuencode replreport.csv replreport.csv | mail -s "Replication Status Report" user@domain.com
 The final output of the script generates a report that looks like the sample below.  Filesystems that have all zeros and no estimated completion time are caught up and not currently performing a data synchronization.
Name System Estimated Completion Time Current Transfer Size (KB) Current Transfer Remain (KB) Write Speed (KB)
SA2Users_03 SA2VNX5500 0 0 0
SA2Users_02 SA2VNX5500 Wed Dec 16 01:16:04 EST 2011 211708152 41788152 2982
SA2Users_01 SA2VNX5500 Wed Dec 16 18:53:32 EST 2011 229431488 59655488 3425
SA2CommonFiles_04 SA2VNX5500 0 0 0
SA2CommonFiles_03 SA2VNX5500 Wed Dec 16 10:35:06 EST 2011 232173216 53853216 3105
SA2CommonFiles_02 SA2VNX5500 Mon Dec 14 15:46:33 EST 2011 56343592 12807592 2365
SA2commonFiles_01 SA2VNX5500 0 0 0

VNX Root Replication Checkpoints

Where did all my savvol space go?  I noticed last week that some of my Celerra replication jobs had stalled and were not sending any new data to the replication partner.  I then noticed that the storage pool designated for checkpoints was at 100%.  Not good. Based on the number of file system checkpoints that we perform, it didn’t seem possible that the pool could be filled up already.  I opened a case with EMC to help out.

I learned something new after opening this call – every time you create a replication job, a new checkpoint is created for that job and stored in the savvol.  You can view these in Unisphere by changing the “select a type” filter to “all checkpoints including replication”.  You’ll notice checkpoints named something like root_rep_ckpt_483_72715_1 in the list, they all begin with root_rep.   After working with EMC for a little while on the case, he helped me determine that one of my replication jobs had a root_rep_ckpt that was 1.5TB in size.

Removing that checkpoint would immediately solve the problem, but there was one major drawback.  Deleting the root_rep checkpoint first requires that you delete the replication job entirely, requiring a complete re-do from scratch.  The entire filesystem would have to be copied over our WAN link and resynchronized with the replication partner Celerra.  That didn’t make me happy, but there was no choice.  At least the problem was solved.

Here are a couple of tips for you if you’re experiencing a similar issue.

You can verify the storage pool the root_rep checkpoints are using by doing an info against the checkpoint from the command line and look for the ‘pool=’ field.

nas_fs –list | grep root_rep  (the first colum in the output is the ID# for the next command)

nas_fs –info id=<id from above>

 You can also see the replication checkpoints and IDs for a particular filesystem with this command:

fs_ckpt <production file system> -list –all

You can check the size of a root_rep checkpoint from the command line directly with this command:

./nas/sbin/rootnas_fs -size root_rep_ckpt_883_72715_1


Use the CLI to determine replication job throughput

This handy command will allow you to determine exactly how much bandwidth you are using for your Celerra replication jobs.

Run this command first, it will generate a file with the stats for all of your replication jobs:

nas_replicate -info -all > /tmp/rep.out

Run this command next:

grep "Current Transfer Rate" /tmp/rep.out |grep -v "= 0"

The output looks like this:

Current Transfer Rate (KB/s)   = 196
 Current Transfer Rate (KB/s)   = 104
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 90
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 88
 Current Transfer Rate (KB/s)   = 94
 Current Transfer Rate (KB/s)   = 89
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 108
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 118
 Current Transfer Rate (KB/s)   = 119
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 27
 Current Transfer Rate (KB/s)   = 136
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 242
 Current Transfer Rate (KB/s)   = 77
 Current Transfer Rate (KB/s)   = 218
 Current Transfer Rate (KB/s)   = 285
 Current Transfer Rate (KB/s)   = 287
 Current Transfer Rate (KB/s)   = 184
 Current Transfer Rate (KB/s)   = 224
 Current Transfer Rate (KB/s)   = 82
 Current Transfer Rate (KB/s)   = 324
 Current Transfer Rate (KB/s)   = 210
 Current Transfer Rate (KB/s)   = 328
 Current Transfer Rate (KB/s)   = 156
 Current Transfer Rate (KB/s)   = 156

Each line represents the throughput for one of your replication jobs.  Adding all of those numbers up will give you the amount of bandwidth you are consuming.  In this case, I’m using about 4.56MB/s on my 100MB link.

This same technique can of course be applied to any part of the output file.  If you want to know the estimated completion date of each of your replication jobs, you’d run this command against the rep.out file:

grep "Estimated Completion Time" /tmp/rep.out

That will give you a list of dates, like this:

Estimated Completion Time      = Fri Jul 15 02:12:53 EDT 2011
 Estimated Completion Time      = Fri Jul 15 08:06:33 EDT 2011
 Estimated Completion Time      = Mon Jul 18 18:35:37 EDT 2011
 Estimated Completion Time      = Wed Jul 13 15:24:03 EDT 2011
 Estimated Completion Time      = Sun Jul 24 05:35:35 EDT 2011
 Estimated Completion Time      = Tue Jul 19 16:35:25 EDT 2011
 Estimated Completion Time      = Fri Jul 15 12:10:25 EDT 2011
 Estimated Completion Time      = Sun Jul 17 16:47:31 EDT 2011
 Estimated Completion Time      = Tue Aug 30 00:30:54 EDT 2011
 Estimated Completion Time      = Sun Jul 31 03:23:08 EDT 2011
 Estimated Completion Time      = Thu Jul 14 08:12:25 EDT 2011
 Estimated Completion Time      = Thu Jul 14 20:01:55 EDT 2011
 Estimated Completion Time      = Sun Jul 31 05:19:26 EDT 2011
 Estimated Completion Time      = Thu Jul 14 17:12:41 EDT 2011

Very useful stuff. 🙂


DM Interconnect failure with Celerra Replicator

We just installed a new VNX 5500 a few weeks ago in the UK, and i intially set up a VDM replication job between it and it’s replication partner, an NS-960 in Canada.  The setup went fine with no errors, and replication of the VDM has completed successfully every day up until yesterday when I noticed that the status on the main replications screen says “network communication has been lost”.   I am able to use the server_ping command to ping the data mover/replication interface from UK to Canada, so network connectivity appears to be ok.

I was attempting to set up new replication jobs for the filesystems on this VDM, and the background tasks to create the replication jobs are stuck at “Establishing communication with secondary side for Create task” with a status of “Incomplete”.

I went to the DM interconnect next to validate that it was working, and the validation test failed with the following message: “Validate Data Mover Interconnect server_2:<SAN_name>. The following interfaces cannot connect: source interface=10.x.x.x destination interface=10.x.x.x, Message_ID=13160415446: Authentication failed for DIC communication.”

So, why is the DM Interconnect is failing?   It was working fine for several weeks!

My next trip was to the server log (>server_log server_2) where I spotted another issue.  Hundreds of entries that looked just like these:

2011-07-07 16:32:07: CMD: 6: CmdReplicatev2ReversePri::startSecondary dicSt 16 cmdSt 214
2011-07-07 16:32:10: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:10: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16
2011-07-07 16:32:12: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:12: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16

Bad Authentication? Hmmm.  There is something amiss with the trusted relationship between the VNX and the NS960.  I did a quick read of EMC’s VNX replication manual (yep, rtfm!) and found the command to update the interconnect, nas_cel.

First, run nas_cel -list to view all of your interconnects, noting the ID number of the one you’re having difficulty with.

[nasadmin@<name> ~]$ nas_cel -list
id    name          owner mount_dev  channel    net_path                                      CMU
0     <name_1>  0                               10.x.x.x                                   APM007039002350000
2     <name_2>      0                           10.x.x.x                                   APM001052420000000
4     <name_3>      0                           10.x.x.x                                   APM009015016510000
5     <name_4>       0                           10.x.x.x                                  APM000827205690000

In this case, I was having trouble with <name_3>, which is ID 4.

Run this command next:  nas_cel -update id=4.   After that command completed, my interconnect immediately started working and I was able to create new replication jobs.