Adding/Removing modules from a datamover

I recently had an issue where a brand new datamover installed by EMC would not allow me to make it a standby for the existing datamovers.  It turns out that the hardware (specifically the number of FC and ethernet interfaces) must match PRECISELY, the number of ports and the slots the modules are installed in have to match across all datamovers.

The new datamover that was installed had an extra 4 port ethernet module installed in it.  Below is the procedure I used to remove the module, including all the commands to take it down, reconfigure it, and bring it back up successfully.  Removing the extra module solved the problem, it matched the config of the others and allowed me to configure it as a standby.

First, log in to the CLI on the control station with root priviliges.  Next, just run the commands below in order.

Turn off connecthome and emails to avoid false alarms.
 /nas/sbin/nas_connecthome -service stop
 /nas/bin/nas_emailuser -modify -enabled no
 /nas/bin/nas_emailuser -info

Copy and paste this to save, it will list the current datamover config.
 nas_server -i -a

Run this to shut the datamover down.  Run getreason to verify when it’s down.
 server_cpu server_<x> -halt now
 /nasmcd/sbin/getreason

Remove/replace the module now.

Power the datamover back on.
 /nasmcd/sbin/t2reset pwron -s <slot number>

Watch getreason for status
 /nasmcd/sbin/getreason
(Wait for it to reboot and say ‘Hardware Misconfigured’)

Once it is in a ‘misconfigured’ state, run setup_slot to configure it:
 /nasmcd/sbin/setup_slot -i 4

Run this command to view the current hardware config, verify that your change was made:
 server_sysconfig server_4 -p

Restart connecthome and email services.
 /nas/sbin/nas_connecthome -service start -clear
 /nas/sbin/nas_connecthome -i
 /nas/bin/nas_emailuser -modify -enabled yes
 /nas/bin/nas_emailuser -info

That’s it!  your datamover has been updated and reconfigured.

Advertisements

VNX Root Replication Checkpoints

Where did all my savvol space go?  I noticed last week that some of my Celerra replication jobs had stalled and were not sending any new data to the replication partner.  I then noticed that the storage pool designated for checkpoints was at 100%.  Not good. Based on the number of file system checkpoints that we perform, it didn’t seem possible that the pool could be filled up already.  I opened a case with EMC to help out.

I learned something new after opening this call – every time you create a replication job, a new checkpoint is created for that job and stored in the savvol.  You can view these in Unisphere by changing the “select a type” filter to “all checkpoints including replication”.  You’ll notice checkpoints named something like root_rep_ckpt_483_72715_1 in the list, they all begin with root_rep.   After working with EMC for a little while on the case, he helped me determine that one of my replication jobs had a root_rep_ckpt that was 1.5TB in size.

Removing that checkpoint would immediately solve the problem, but there was one major drawback.  Deleting the root_rep checkpoint first requires that you delete the replication job entirely, requiring a complete re-do from scratch.  The entire filesystem would have to be copied over our WAN link and resynchronized with the replication partner Celerra.  That didn’t make me happy, but there was no choice.  At least the problem was solved.

Here are a couple of tips for you if you’re experiencing a similar issue.

You can verify the storage pool the root_rep checkpoints are using by doing an info against the checkpoint from the command line and look for the ‘pool=’ field.

nas_fs –list | grep root_rep  (the first colum in the output is the ID# for the next command)

nas_fs –info id=<id from above>

 You can also see the replication checkpoints and IDs for a particular filesystem with this command:

fs_ckpt <production file system> -list –all

You can check the size of a root_rep checkpoint from the command line directly with this command:

./nas/sbin/rootnas_fs -size root_rep_ckpt_883_72715_1

 

Use the CLI to determine replication job throughput

This handy command will allow you to determine exactly how much bandwidth you are using for your Celerra replication jobs.

Run this command first, it will generate a file with the stats for all of your replication jobs:

nas_replicate -info -all > /tmp/rep.out

Run this command next:

grep "Current Transfer Rate" /tmp/rep.out |grep -v "= 0"

The output looks like this:

Current Transfer Rate (KB/s)   = 196
 Current Transfer Rate (KB/s)   = 104
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 90
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 88
 Current Transfer Rate (KB/s)   = 94
 Current Transfer Rate (KB/s)   = 89
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 108
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 118
 Current Transfer Rate (KB/s)   = 119
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 27
 Current Transfer Rate (KB/s)   = 136
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 242
 Current Transfer Rate (KB/s)   = 77
 Current Transfer Rate (KB/s)   = 218
 Current Transfer Rate (KB/s)   = 285
 Current Transfer Rate (KB/s)   = 287
 Current Transfer Rate (KB/s)   = 184
 Current Transfer Rate (KB/s)   = 224
 Current Transfer Rate (KB/s)   = 82
 Current Transfer Rate (KB/s)   = 324
 Current Transfer Rate (KB/s)   = 210
 Current Transfer Rate (KB/s)   = 328
 Current Transfer Rate (KB/s)   = 156
 Current Transfer Rate (KB/s)   = 156

Each line represents the throughput for one of your replication jobs.  Adding all of those numbers up will give you the amount of bandwidth you are consuming.  In this case, I’m using about 4.56MB/s on my 100MB link.

This same technique can of course be applied to any part of the output file.  If you want to know the estimated completion date of each of your replication jobs, you’d run this command against the rep.out file:

grep "Estimated Completion Time" /tmp/rep.out

That will give you a list of dates, like this:

Estimated Completion Time      = Fri Jul 15 02:12:53 EDT 2011
 Estimated Completion Time      = Fri Jul 15 08:06:33 EDT 2011
 Estimated Completion Time      = Mon Jul 18 18:35:37 EDT 2011
 Estimated Completion Time      = Wed Jul 13 15:24:03 EDT 2011
 Estimated Completion Time      = Sun Jul 24 05:35:35 EDT 2011
 Estimated Completion Time      = Tue Jul 19 16:35:25 EDT 2011
 Estimated Completion Time      = Fri Jul 15 12:10:25 EDT 2011
 Estimated Completion Time      = Sun Jul 17 16:47:31 EDT 2011
 Estimated Completion Time      = Tue Aug 30 00:30:54 EDT 2011
 Estimated Completion Time      = Sun Jul 31 03:23:08 EDT 2011
 Estimated Completion Time      = Thu Jul 14 08:12:25 EDT 2011
 Estimated Completion Time      = Thu Jul 14 20:01:55 EDT 2011
 Estimated Completion Time      = Sun Jul 31 05:19:26 EDT 2011
 Estimated Completion Time      = Thu Jul 14 17:12:41 EDT 2011

Very useful stuff. 🙂

 

Use the CLI to quickly determine the size of your Celerra checkpoint filesystems

Need to quickly figure out which checkpoint filesystems are taking up all of your precious savvol space?  Run the CLI command below.  Filling up the savvol storage pool can cause all kinds of problems besides failing checkpoints.  It can also cause filesystem replication jobs to fail.

To view it on the screen:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size

To save it in a file:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size > checkpoints.txt

vi checkpoints.txt   (to view the file)

Here’s a sample of the output:

UserFilesystem_01
ckpt_ckpt_UserFilesystem_01_monthly_001 :   836 : 220000
ckpt_ckpt_UserFilesystem_01_monthly_002 :   649 : 220000

UserFilesystem_02
ckpt_ckpt_UserFilesystem_02_monthly_001 :   836 : 80000
ckpt_ckpt_UserFilesystem_02_monthly_002 :   649 : 80000

The numbers are in MB.

 

Unable to provision Celerra storage?

This one really made no sense to me at first.  I was attempting to create a new file storage pool on our NS960 Celerra.  Upon launching the disk provisioning wizard, it would pause for a minute and then give the following error:

“ERROR: Unable to continue provisioning.  click for details”

It was strange because I have plenty of disks that could be used for provisioning.  Why wasn’t it working?

Here is the detailed error message:

Message Details:

Message: Unable to continue provisioning

Full Description:  Not able to fetch disk information. n Command Failed, error code: 1, output: errormessage:string=”Timeout (60 seconds) waiting for state SS_DISKS_LOADED”

Recommended Action: No recommended action is available. Go to http://powerlink.EMC.com for more information.

Event Code: 15301214354

As a workaround and to test the issue, I used 8 spare SATA drives and created a new raid group, with one large LUN using all of the space.  I added it to the Celerra storage group and rescanned the SAN for storage.

The following error popped up:

Brief Description:  Invalid credentials for the storage array APM01034413494. 
Full Description:  The FLARE version on this storage array requires secure communication. Saved credentials are found, but authentication failed. The Control Station does not have valid credentials. 
Recommended Action:  Set the credentials by running the “nas_storage -modify <backend_name> -security” command. 
Message ID:  13422231564 

Well, how about that.  An error message that actually gives you the command to resolve the problem. 🙂  As it turns out, one of our other SAN administrators had changed the password for the system account.  Running nas_storage -modify id=<xx> -security resolved the problem.

(Note: You can get the ID number by running nas_storage -list)

DM Interconnect failure with Celerra Replicator

We just installed a new VNX 5500 a few weeks ago in the UK, and i intially set up a VDM replication job between it and it’s replication partner, an NS-960 in Canada.  The setup went fine with no errors, and replication of the VDM has completed successfully every day up until yesterday when I noticed that the status on the main replications screen says “network communication has been lost”.   I am able to use the server_ping command to ping the data mover/replication interface from UK to Canada, so network connectivity appears to be ok.

I was attempting to set up new replication jobs for the filesystems on this VDM, and the background tasks to create the replication jobs are stuck at “Establishing communication with secondary side for Create task” with a status of “Incomplete”.

I went to the DM interconnect next to validate that it was working, and the validation test failed with the following message: “Validate Data Mover Interconnect server_2:<SAN_name>. The following interfaces cannot connect: source interface=10.x.x.x destination interface=10.x.x.x, Message_ID=13160415446: Authentication failed for DIC communication.”

So, why is the DM Interconnect is failing?   It was working fine for several weeks!

My next trip was to the server log (>server_log server_2) where I spotted another issue.  Hundreds of entries that looked just like these:

2011-07-07 16:32:07: CMD: 6: CmdReplicatev2ReversePri::startSecondary dicSt 16 cmdSt 214
2011-07-07 16:32:10: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:10: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16
2011-07-07 16:32:12: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:12: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16

Bad Authentication? Hmmm.  There is something amiss with the trusted relationship between the VNX and the NS960.  I did a quick read of EMC’s VNX replication manual (yep, rtfm!) and found the command to update the interconnect, nas_cel.

First, run nas_cel -list to view all of your interconnects, noting the ID number of the one you’re having difficulty with.

[nasadmin@<name> ~]$ nas_cel -list
id    name          owner mount_dev  channel    net_path                                      CMU
0     <name_1>  0                               10.x.x.x                                   APM007039002350000
2     <name_2>      0                           10.x.x.x                                   APM001052420000000
4     <name_3>      0                           10.x.x.x                                   APM009015016510000
5     <name_4>       0                           10.x.x.x                                  APM000827205690000

In this case, I was having trouble with <name_3>, which is ID 4.

Run this command next:  nas_cel -update id=4.   After that command completed, my interconnect immediately started working and I was able to create new replication jobs.