We just installed a new VNX 5500 a few weeks ago in the UK, and i intially set up a VDM replication job between it and it’s replication partner, an NS-960 in Canada. The setup went fine with no errors, and replication of the VDM has completed successfully every day up until yesterday when I noticed that the status on the main replications screen says “network communication has been lost”. I am able to use the server_ping command to ping the data mover/replication interface from UK to Canada, so network connectivity appears to be ok.
I was attempting to set up new replication jobs for the filesystems on this VDM, and the background tasks to create the replication jobs are stuck at “Establishing communication with secondary side for Create task” with a status of “Incomplete”.
I went to the DM interconnect next to validate that it was working, and the validation test failed with the following message: “Validate Data Mover Interconnect server_2:<SAN_name>. The following interfaces cannot connect: source interface=10.x.x.x destination interface=10.x.x.x, Message_ID=13160415446: Authentication failed for DIC communication.”
So, why is the DM Interconnect is failing? It was working fine for several weeks!
My next trip was to the server log (>server_log server_2) where I spotted another issue. Hundreds of entries that looked just like these:
2011-07-07 16:32:07: CMD: 6: CmdReplicatev2ReversePri::startSecondary dicSt 16 cmdSt 214
2011-07-07 16:32:10: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:10: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16
2011-07-07 16:32:12: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:12: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16
Bad Authentication? Hmmm. There is something amiss with the trusted relationship between the VNX and the NS960. I did a quick read of EMC’s VNX replication manual (yep, rtfm!) and found the command to update the interconnect, nas_cel.
First, run nas_cel -list to view all of your interconnects, noting the ID number of the one you’re having difficulty with.
[nasadmin@<name> ~]$ nas_cel -list
id name owner mount_dev channel net_path CMU
0 <name_1> 0 10.x.x.x APM007039002350000
2 <name_2> 0 10.x.x.x APM001052420000000
4 <name_3> 0 10.x.x.x APM009015016510000
5 <name_4> 0 10.x.x.x APM000827205690000
In this case, I was having trouble with <name_3>, which is ID 4.
Run this command next: nas_cel -update id=4. After that command completed, my interconnect immediately started working and I was able to create new replication jobs.