Tag Archives: job

Errors when creating new replication jobs

I was attempting to create a new replication job on one of our VNX5500’s and was receiving several errors when selecting our DR NS-960 as the ‘destination celerra network server’.

It was displaying the following errors at the top of the window:

– “Query VDMs All.  Cannot access any Data Mover on the remote system, <celerra_name>”. The error details directed me to check that all the Data Moverss are accessible, that the time difference between the source and destination doesn’t exceed 10 min, and that the passphrase matches.  I confirmed that all of those were fine.

– “Query Storage Pools All.  Remote command failed:\nremote celerra – <celerra_name>\nremote exit status =0\nremote error = 0\nremote message = HTTP Error 500: Internal Server Error”.  The error details on this message say to search powerlink, not a very useful description.

– “There are no destination pools available”.  The details on this error say to check available space on the destination storage pool.  There is 3.5TB available in the pool I want to use on the destination side, so that wasn’t the issue either.

All existing replication jobs were still running fine so I knew there was not a network connectivity problem.  I reviewed the following items as well:

– I was able to validate all of the interconnects successfully, that wasn’t the issue.

– I ran nas_cel -update on the interconnects on both sides and received no errors, but it made no difference.

– I checked the server logs and didn’t see any errors relating to replication.

Not knowing where to look next, I opened an SR with EMC.  As it turns out, it was a security issue.

About a month ago an EMC CE accidently deleted our global security accounts during a service call.  I had recreated all of the deleted accounts and didn’t think there would be any further issues.  Logging in with the re-created nasadmin account after the accidental deletion was the root cause of the problem.  Here’s why:

The clariion global user account is tied to a local user account on the control station in /etc/passwd. When nasadmin was recreated on the domain, it attempted to create the nasadmin account on the control station as well.  Because the account already existed as a local account on the control station, it created a local account named ‘nasadmin1‘ instead, which is what caused the problem.  The two nasadmin accounts were no longer synchronized between the Celerra and the Clariion domain, so when logging in with the global nasadmin account you were no longer tied to the local nasadmin account on the control station.  Deleting all nasadmin accounts from the global domain and from the local /etc/passwd on the Celerra, and then recreating nasadmin in the domain solves the problem.  Because the issue was related only to the nasadmin account in this case, I could have also solved the problem by simply creating a new global account (with administrator priviliges) and using that to create the replication job.  I tested that as well and it worked fine.

Advertisements

Long Running FAST VP relocation job

I’ve noticed that our auto-tier data relocation job that runs every evening consistently shows 2+days for the estimated time of completion. We have it set to run only 8 hours per day, so with our current configuration it’s likely the job will never reach a completed state. Based on that observation I started investigating what options I had to try and reduce the amount of time that the relocation jobs runs.

Running this command will tell you the current amount of time estimated to complete the relocation job data migrations and how much data is queued up to move:

Naviseccli –h <clarion_ip> autotiering –info –opStatus

Auto-Tiering State: Enabled
Relocation Rate: Medium
Schedule Name: Default Schedule
Schedule State: Enabled
Default Schedule: Yes
Schedule Days: Sun Mon Tue Wed Thu Fri Sat
Schedule Start Time: 22:00
Schedule Stop Time: 6:00
Schedule Duration: 8 hours
Storage Pools: Clariion1_SPB, Clariion2_SPA
Storage Pool Name: Clariion2_SPA
Storage Pool ID: 0
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 2854.11
Data to Move Down (GBs): 1909.06
Data Movement Completed (GBs): 2316.00
Estimated Time to Complete: 2 days, 9 hours, 12 minutes
Schedule Duration Remaining: None
 

I’ll review some possibilities based on research I’ve done in the past few days.  I’m still in the evaluation process and have not made any changes yet, I’ll update this blog post once I’ve implemented a change myself.  If you are having issues with your data relocation job not finishing I would recommend opening an SR with EMC support for a detailed analysis before implementing any of these options.

1. Reduce the number of LUNs that use auto-tiering by disabling it on a LUN-by-LUN basis.

I would recommend monitoring which LUNs have the highest rate of change when the relocation job runs and then evaluate if any can be removed from auto-tiering altogether.  The goal of this would be to reduce the amount of data that needs to be moved.  The one caveat with this process is that when a LUN has auto-tiering disabled, the tier distribution of the LUN will remain exactly the same from the moment it is disabled.  If you disable it on a LUN that is using a large amount of EFD it will not change unless you force it to a different tier or re-enable auto-tiering later.

This would be an effective way to reduce the amount of data being relocated, but the process of determining which LUNs should have auto-tiering disabled is subjective and would require careful analysis.

2. Reset all the counters on the relocation job.

Any incorrectly labeled “hot” data will be removed from the counters and all LUNs would be re-evaluated for data movement.  One of the potential problems with auto-tiering is with servers that have IO intensive batch jobs that run infrequently.  That data would be incorrectly labeled as “hot” and scheduled to move up even though the server is not normally busy.  This information is detailed in emc268245.

To reset the counters, use the command to stop and start autotiering:

Naviseccli –h <clarion_ip> autotiering –relocation -<stop | start>

If you need to temporarily stop replication and do not want to reset the counters, use the pause/resume command instead:

Naviseccli –h <clarion_ip> autotiering –relocation -<pause | resume>

I wanted to point out that changing a specific LUN from “auto-tier” to “No Movement” also does not reset the counters, the LUN will maintain it’s tiering schedule. It is the same as pausing auto-tiering just for that LUN.

3. Increase free space available on the storage pools.

If your storage pools are nearly 100% utilized there may not be enough space to effectively migrate the data between the tiers.  Add additional disks to the pool, or migrate LUNs to other RAID groups or storage pools.

4. Increase the relocation rate.

This of course could have dramatic effects on IO performance if it’s increased and it should only be changed during periods of measured low IO activity.

Run this command to change the data relocation rate:

Naviseccli –h <clarion_ip> autotiering –setRate –rate <high | medium | low>

5. Use a batch or shell script to pause and restart the job with the goal of running it more frequently during periods of low IO activity.

There is no way to set the relocation schedule to run at different times on different days of the week, a script is necessary to accomplish that.  I currently run the job only in the middle of the night during off peak (non-business) hours, but I would be able to run it all weekend as well.  I have done that manually in the past.

You would need to use an external windows or unix server to schedule the scripts.  The relocation schedule should be set to run 24×7, then add the pause/resume command to have the job pause during the times you don’t want it to run.  To have it run on weekends and overnight, set up two separate scripts (one for pause and one for resume), then schedule each with task scheduler or cron to run throughout the week.

The cron schedule below would allow it to run from 10PM to 6AM on weeknights and from 10PM to 6AM on Monday over the weekend.

pause.sh:       Naviseccli –h <clarion_ip> autotiering –relocation –pause

resume.sh:   Naviseccli –h <clarion_ip> autotiering –relocation -resume

0 6 * * *  /scripts/pause.sh        @6AM on Monday – pause
0 10 * * * /scripts/resume.sh    @10PM on Monday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Tuesay – pause
0 10 * * * /scripts/resume.sh    @10PM on Tuesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Wednesday – pause
0 10 * * * /scripts/resume.sh    @10PM on Wednesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Thursday – pause
0 10 * * * /scripts/resume.sh    @10PM on Thursday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Friday – pause
0 10 * * * /scripts/resume.sh    @10PM on Friday – resume
<Do not pause again until Monday morning>
 

Reporting on the state of VNX auto-tiering

 

To go along with my previous post (reporting on LUN tier distribution) I also include information on the same intranet page about the current state of the auto-tiering job.  We run auto-tiering from 10PM to 6AM in the morning to avoid the movement of data during business hours or our normal backup window in the evening.

Sometimes the auto-tiering job will get very backed up and would theoretically never finish in the time slot that we have for data movement.  I like to keep tabs on the amount of data that needs to move up or down, and the amount of time that the array estimates until it’s completion.  If needed, I will sometimes modify the schedule to run 24 hours a day over the weekend and change it back early on Monday morning.  Unfortunately, EMC did not design the auto-tiering scheduler to allow for creating different time windows on different days. It’s a manual process.

This is a relatively simple, one line CLI command, but it provides very useful info and it’s convenient to add it to a daily report to see it at a glance.

I run this script at 6AM every day, immediately following the end of the window for data to move:

naviseccli -h clariion1_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion1_hostname.autotier.txt

naviseccli -h clariion2_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion2_hostname.autotier.txt

naviseccli -h clariion3_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion3_hostname.autotier.txt

naviseccli -h clariion4_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion4_hostname.autotier.txt

 ....
 The output for each individual clariion looks like this:
Auto-Tiering State: Enabled
Relocation Rate: Medium

Schedule Name: Default Schedule
Schedule State: Enabled
Default Schedule: Yes
Schedule Days: Sun Mon Tue Wed Thu Fri Sat
Schedule Start Time: 22:00
Schedule Stop Time: 6:00
Schedule Duration: 8 hours
Storage Pools: Clariion1_SPB, Clariion2_SPA

Storage Pool Name: Clariion2_SPA
Storage Pool ID: 0
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 1854.11
Data to Move Down (GBs): 909.06
Data Movement Completed (GBs): 2316.00
Estimated Time to Complete: 9 hours, 12 minutes
Schedule Duration Remaining: None

Storage Pool Name: Clariion1_SPB
Storage Pool ID: 1
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 1757.11
Data to Move Down (GBs): 878.05
Data Movement Completed (GBs): 1726.00
Estimated Time to Complete: 11 hours, 42 minutes
Schedule Duration Remaining: None
 
 

Use the CLI to determine replication job throughput

This handy command will allow you to determine exactly how much bandwidth you are using for your Celerra replication jobs.

Run this command first, it will generate a file with the stats for all of your replication jobs:

nas_replicate -info -all > /tmp/rep.out

Run this command next:

grep "Current Transfer Rate" /tmp/rep.out |grep -v "= 0"

The output looks like this:

Current Transfer Rate (KB/s)   = 196
 Current Transfer Rate (KB/s)   = 104
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 90
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 88
 Current Transfer Rate (KB/s)   = 94
 Current Transfer Rate (KB/s)   = 89
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 108
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 118
 Current Transfer Rate (KB/s)   = 119
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 27
 Current Transfer Rate (KB/s)   = 136
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 242
 Current Transfer Rate (KB/s)   = 77
 Current Transfer Rate (KB/s)   = 218
 Current Transfer Rate (KB/s)   = 285
 Current Transfer Rate (KB/s)   = 287
 Current Transfer Rate (KB/s)   = 184
 Current Transfer Rate (KB/s)   = 224
 Current Transfer Rate (KB/s)   = 82
 Current Transfer Rate (KB/s)   = 324
 Current Transfer Rate (KB/s)   = 210
 Current Transfer Rate (KB/s)   = 328
 Current Transfer Rate (KB/s)   = 156
 Current Transfer Rate (KB/s)   = 156

Each line represents the throughput for one of your replication jobs.  Adding all of those numbers up will give you the amount of bandwidth you are consuming.  In this case, I’m using about 4.56MB/s on my 100MB link.

This same technique can of course be applied to any part of the output file.  If you want to know the estimated completion date of each of your replication jobs, you’d run this command against the rep.out file:

grep "Estimated Completion Time" /tmp/rep.out

That will give you a list of dates, like this:

Estimated Completion Time      = Fri Jul 15 02:12:53 EDT 2011
 Estimated Completion Time      = Fri Jul 15 08:06:33 EDT 2011
 Estimated Completion Time      = Mon Jul 18 18:35:37 EDT 2011
 Estimated Completion Time      = Wed Jul 13 15:24:03 EDT 2011
 Estimated Completion Time      = Sun Jul 24 05:35:35 EDT 2011
 Estimated Completion Time      = Tue Jul 19 16:35:25 EDT 2011
 Estimated Completion Time      = Fri Jul 15 12:10:25 EDT 2011
 Estimated Completion Time      = Sun Jul 17 16:47:31 EDT 2011
 Estimated Completion Time      = Tue Aug 30 00:30:54 EDT 2011
 Estimated Completion Time      = Sun Jul 31 03:23:08 EDT 2011
 Estimated Completion Time      = Thu Jul 14 08:12:25 EDT 2011
 Estimated Completion Time      = Thu Jul 14 20:01:55 EDT 2011
 Estimated Completion Time      = Sun Jul 31 05:19:26 EDT 2011
 Estimated Completion Time      = Thu Jul 14 17:12:41 EDT 2011

Very useful stuff. 🙂