Long Running FAST VP relocation job

I’ve noticed that our auto-tier data relocation job that runs every evening consistently shows 2+days for the estimated time of completion. We have it set to run only 8 hours per day, so with our current configuration it’s likely the job will never reach a completed state. Based on that observation I started investigating what options I had to try and reduce the amount of time that the relocation jobs runs.

Running this command will tell you the current amount of time estimated to complete the relocation job data migrations and how much data is queued up to move:

Naviseccli –h <clarion_ip> autotiering –info –opStatus

Auto-Tiering State: Enabled
Relocation Rate: Medium
Schedule Name: Default Schedule
Schedule State: Enabled
Default Schedule: Yes
Schedule Days: Sun Mon Tue Wed Thu Fri Sat
Schedule Start Time: 22:00
Schedule Stop Time: 6:00
Schedule Duration: 8 hours
Storage Pools: Clariion1_SPB, Clariion2_SPA
Storage Pool Name: Clariion2_SPA
Storage Pool ID: 0
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 2854.11
Data to Move Down (GBs): 1909.06
Data Movement Completed (GBs): 2316.00
Estimated Time to Complete: 2 days, 9 hours, 12 minutes
Schedule Duration Remaining: None
 

I’ll review some possibilities based on research I’ve done in the past few days.  I’m still in the evaluation process and have not made any changes yet, I’ll update this blog post once I’ve implemented a change myself.  If you are having issues with your data relocation job not finishing I would recommend opening an SR with EMC support for a detailed analysis before implementing any of these options.

1. Reduce the number of LUNs that use auto-tiering by disabling it on a LUN-by-LUN basis.

I would recommend monitoring which LUNs have the highest rate of change when the relocation job runs and then evaluate if any can be removed from auto-tiering altogether.  The goal of this would be to reduce the amount of data that needs to be moved.  The one caveat with this process is that when a LUN has auto-tiering disabled, the tier distribution of the LUN will remain exactly the same from the moment it is disabled.  If you disable it on a LUN that is using a large amount of EFD it will not change unless you force it to a different tier or re-enable auto-tiering later.

This would be an effective way to reduce the amount of data being relocated, but the process of determining which LUNs should have auto-tiering disabled is subjective and would require careful analysis.

2. Reset all the counters on the relocation job.

Any incorrectly labeled “hot” data will be removed from the counters and all LUNs would be re-evaluated for data movement.  One of the potential problems with auto-tiering is with servers that have IO intensive batch jobs that run infrequently.  That data would be incorrectly labeled as “hot” and scheduled to move up even though the server is not normally busy.  This information is detailed in emc268245.

To reset the counters, use the command to stop and start autotiering:

Naviseccli –h <clarion_ip> autotiering –relocation -<stop | start>

If you need to temporarily stop replication and do not want to reset the counters, use the pause/resume command instead:

Naviseccli –h <clarion_ip> autotiering –relocation -<pause | resume>

I wanted to point out that changing a specific LUN from “auto-tier” to “No Movement” also does not reset the counters, the LUN will maintain it’s tiering schedule. It is the same as pausing auto-tiering just for that LUN.

3. Increase free space available on the storage pools.

If your storage pools are nearly 100% utilized there may not be enough space to effectively migrate the data between the tiers.  Add additional disks to the pool, or migrate LUNs to other RAID groups or storage pools.

4. Increase the relocation rate.

This of course could have dramatic effects on IO performance if it’s increased and it should only be changed during periods of measured low IO activity.

Run this command to change the data relocation rate:

Naviseccli –h <clarion_ip> autotiering –setRate –rate <high | medium | low>

5. Use a batch or shell script to pause and restart the job with the goal of running it more frequently during periods of low IO activity.

There is no way to set the relocation schedule to run at different times on different days of the week, a script is necessary to accomplish that.  I currently run the job only in the middle of the night during off peak (non-business) hours, but I would be able to run it all weekend as well.  I have done that manually in the past.

You would need to use an external windows or unix server to schedule the scripts.  The relocation schedule should be set to run 24×7, then add the pause/resume command to have the job pause during the times you don’t want it to run.  To have it run on weekends and overnight, set up two separate scripts (one for pause and one for resume), then schedule each with task scheduler or cron to run throughout the week.

The cron schedule below would allow it to run from 10PM to 6AM on weeknights and from 10PM to 6AM on Monday over the weekend.

pause.sh:       Naviseccli –h <clarion_ip> autotiering –relocation –pause

resume.sh:   Naviseccli –h <clarion_ip> autotiering –relocation -resume

0 6 * * *  /scripts/pause.sh        @6AM on Monday – pause
0 10 * * * /scripts/resume.sh    @10PM on Monday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Tuesay – pause
0 10 * * * /scripts/resume.sh    @10PM on Tuesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Wednesday – pause
0 10 * * * /scripts/resume.sh    @10PM on Wednesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Thursday – pause
0 10 * * * /scripts/resume.sh    @10PM on Thursday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Friday – pause
0 10 * * * /scripts/resume.sh    @10PM on Friday – resume
<Do not pause again until Monday morning>
 
Advertisements

Undocumented Celerra / VNX File commands

vnx1

The .server_config command is undocumented from EMC, I assume they don’t want customers messing with it. Use these commands at your own risk. 🙂

Below is a list of some of those undocumented commands, most are meant for viewing performance stats. I’ve had EMC support use the fcp command during a support call in the past.   When using the command for fcp stats,  I believe you need to run the ‘reset’ command first as it enables the collection of statistics.

There are likely other parameters that can be used with .server_config but I haven’t discovered them yet.

TCP Stats:

To view TCP info:
.server_config server_x -v “printstats tcpstat”
.server_config server_x -v “printstats tcpstat full”
.server_config server_x -v “printstats tcpstat reset”

Sample Output (truncated):
TCP stats :
connections initiated 8698
connections accepted 1039308
connections established 1047987
connections dropped 524
embryonic connections dropped 3629
conn. closed (includes drops) 1051582
segs where we tried to get rtt 8759756
times we succeeded 11650825
delayed acks sent 537525
conn. dropped in rxmt timeout 0
retransmit timeouts 823

SCSI Stats:

To view SCSI IO info:
.server_config server_x -v “printstats scsi”
.server_config server_x -v “printstats scsi reset”

Sample Output:
This output needs to be in a fixed width font to view properly.  I can’t seem to adjust the font, so I’ve attempted to add spaces to align it.
Ctlr: IO-pending Max-IO IO-total Idle(ms) Busy(ms) Busy(%)
0:      0         53    44925729       122348758     19159954   13%
1:      0                                           1 1 141508682       0          0%
2:      0                                           1 1 141508682       0          0%
3:      0                                           1 1 141508682       0          0%
4:      0                                           1 1 141508682       0          0%

File Stats:

.server_config server_x -v “printstats filewrite”
.server_config server_x -v “printstats filewrite full”
.server_config server_x -v “printstats filewrite reset”

Sample output (Full Output):
13108 writes of 1 blocks in 52105250 usec, ave 3975 usec
26 writes of 2 blocks in 256359 usec, ave 9859 usec
6 writes of 3 blocks in 18954 usec, ave 3159 usec
2 writes of 4 blocks in 2800 usec, ave 1400 usec
4 writes of 13 blocks in 6284 usec, ave 1571 usec
4 writes of 18 blocks in 7839 usec, ave 1959 usec
total 13310 blocks in 52397489 usec, ave 3936 usec

FCP Stats:

To view FCP stats, useful for checking SP balance:
.server_config server_x -v “printstats fcp”
.server_config server_x -v “printstats fcp full”
.server_config server_x -v “printstats fcp reset”

Sample Output (Truncated):
This output needs to be in a fixed width font to view properly.  I can’t seem to adjust the font, so I’ve attempted to add spaces to align it.
Total I/O Cmds: +0%——25%——-50%——-75%—–100%+ Total 0
FCP HBA 0 |                                                                                            | 0%  0
FCP HBA 1 |                                                                                            | 0%  0
FCP HBA 2 |                                                                                            | 0%  0
FCP HBA 3 |                                                                                            | 0%  0
# Read Cmds: +0%——25%——-50%——-75%—–100%+ Total 0
FCP HBA 0 |                                                                                            | 0% 0
FCP HBA 1 |                                                                                            | 0% 0
FCP HBA 2 |                                                                                            | 0% 0
FCP HBA 3 |  XXXXXXXXXXX                                                          | 25% 0

Usage:

‘fcp’ options are:       bind …, flags, locate, nsshow, portreset=n, rediscover=n
rescan, reset, show, status=n, topology, version

‘fcp bind’ options are:  clear=n, read, rebind, restore=n, show
showbackup=n, write

Description:

Commands for ‘fcp’ operations:
fcp bind <cmd> ……… Further fibre channel binding commands
fcp flags ………….. Show online flags info
fcp locate …………. Show ScsiBus and port info
fcp nsshow …………. Show nameserver info
fcp portreset=n …….. Reset fibre port n
fcp rediscover=n ……. Force fabric discovery process on port n
Bounces the link, but does not reset the port
fcp rescan …………. Force a rescan of all LUNS
fcp reset ………….. Reset all fibre ports
fcp show …………… Show fibre info
fcp status=n ……….. Show link status for port n
fcp status=n clear ….. Clear link status for port n and then Show
fcp topology ……….. Show fabric topology info
fcp version ………… Show firmware, driver and BIOS version

Commands for ‘fcp bind’ operations:
fcp bind clear=n ……. Clear the binding table in slot n
fcp bind read ………. Read the binding table
fcp bind rebind …….. Force the binding thread to run
fcp bind restore=n ….. Restore the binding table in slot n
fcp bind show ………. Show binding table info
fcp bind showbackup=n .. Show Backup binding table info in slot n
fcp bind write ……… Write the binding table

NDMP Stats:

To Check NDMP Status:
.server_config server_x -v “printstats vbb show”

CIFS Stats:

This will output a CIFS report, including all servers, DC’s, IP’s, interfaces, Mac addresses, and more.

.server_config server_x -v “cifs”

Sample Output:

1327007227: SMB: 6: 256 Cifs threads started
1327007227: SMB: 6: Security mode = NT
1327007227: SMB: 6: Max protocol = SMB2
1327007227: SMB: 6: I18N mode = UNICODE
1327007227: SMB: 6: Home Directory Shares DISABLED
1327007227: SMB: 6: Usermapper auto broadcast enabled
1327007227: SMB: 6:
1327007227: SMB: 6: Usermapper[0] = [127.0.0.1] state:active (auto discovered)
1327007227: SMB: 6:
1327007227: SMB: 6: Default WINS servers = 172.168.1.5
1327007227: SMB: 6: Enabled interfaces: (All interfaces are enabled)
1327007227: SMB: 6:
1327007227: SMB: 6: Disabled interfaces: (No interface disabled)
1327007227: SMB: 6:
1327007227: SMB: 6: Unused Interface(s):
1327007227: SMB: 6:  if=172-168-1-84 l=172.168.1.84 b=172.168.1.255 mac=0:60:48:1c:46:96
1327007227: SMB: 6:  if=172-168-1-82 l=172.168.1.82 b=172.168.1.255 mac=0:60:48:1c:10:5d
1327007227: SMB: 6:  if=172-168-1-81 l=172.168.1.81 b=172.168.1.255 mac=0:60:48:1c:46:97
1327007227: SMB: 6:
1327007227: SMB: 6:
1327007227: SMB: 6: DOMAIN DOMAIN_NAME FQDN=DOMAIN_NAME.net SITE=STL-Colo RC=24
1327007227: SMB: 6:  SID=S-1-5-15-7c531fd3-6b6745cb-ff77ddb-ffffffff
1327007227: SMB: 6:  DC=DCAD01(172.168.1.5) ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD02(172.168.29.8) ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD03(172.168.30.8) ref=2 time=0 ms
1327007227: SMB: 6:  DC=DCAD04(172.168.28.15) ref=2 time=0 ms
1327007227: SMB: 6: >DC=SERVERDCAD01(172.168.1.122) ref=334 time=1 ms (Closest Site)
1327007227: SMB: 6: >DC=SERVERDCAD02(172.168.1.121) ref=273 time=1 ms (Closest Site)
1327007227: SMB: 6:
1327007227: SMB: 6: CIFS Server SERVERFILESEMC[DOMAIN_NAME] RC=603
1327007227: UFS: 7: inc ino blk cache count: nInoAllocs 361: inoBlk 0x0219f2a308
1327007227: SMB: 6:  Full computer name=SERVERFILESEMC.DOMAIN_NAME.net realm=DOMAIN_NAME.NET
1327007227: SMB: 6:  Comment=’EMC-SNAS:T6.0.41.3′
1327007227: SMB: 6:  if=172-168-1-161 l=172.168.1.161 b=172.168.1.255 mac=0:60:48:1c:46:9c
1327007227: SMB: 6:   FQDN=SERVERFILESEMC.DOMAIN_NAME.net (Updated to DNS)
1327007227: SMB: 6:  Password change interval: 0 minutes
1327007227: SMB: 6:  Last password change: Fri Jan  7 19:25:30 2011 GMT
1327007227: SMB: 6:  Password versions: 2, 2
1327007227: SMB: 6:
1327007227: SMB: 6: CIFS Server SERVERBKUPEMC[DOMAIN_NAME] RC=2 (local users supported)
1327007227: SMB: 6:  Full computer name=SERVERbkupEMC.DOMAIN_NAME.net realm=DOMAIN_NAME.NET
1327007227: SMB: 6:  Comment=’EMC-SNAS:T6.0.41.3′
1327007227: SMB: 6:  if=172-168-1-90 l=172.168.1.90 b=172.168.1.255 mac=0:60:48:1c:10:54
1327007227: SMB: 6:   FQDN=SERVERbkupEMC.DOMAIN_NAME.net (Updated to DNS)
1327007227: SMB: 6:  Password change interval: 0 minutes
1327007227: SMB: 6:  Last password change: Thu Sep 30 16:23:50 2010 GMT
1327007227: SMB: 6:  Password versions: 2
1327007227: SMB: 6:
 

Domain Controller Commands:

These commands are useful for troubleshooting a windows domain controller connection issue on the control station.  Use these commands along with checking the normal server log (server_log server_2) to troubleshoot that type of problem.

To view the current domain controllers visible on the data mover:

.server_config server_2 -v “pdc dump”

Sample Output (Truncated):

1327006571: SMB: 6: Dump DC for dom='<domain_name>’ OrdNum=0
1327006571: SMB: 6: Domain=<domain_name> Next trusted domains update in 476 seconds1327006571: SMB: 6:  oldestDC:DomCnt=1,179531 Time=Sat Oct 15 15:32:14 2011
1327006571: SMB: 6:  Trusted domain info from DC='<Windows_DC_Servername>’ (423 seconds ago)
1327006571: SMB: 6:   Trusted domain:<domain_name>.net [<Domain_Name>]
   GUID:00000000-0000-0000-0000-000000000000
1327006571: SMB: 6:    Flags=0x20 Ix=0 Type=0x2 Attr=0x0
1327006571: SMB: 6:    SID=S-1-5-15-d1d612b1-87382668-9ba5ebc0
1327006571: SMB: 6:    DC=’-‘
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x547,1355
1327006571: SMB: 6:   Trusted domain: <Domain_Name>
1327006571: SMB: 6:    Flags=0x22 Ix=0 Type=0x1 Attr=0x1000000
1327006571: SMB: 6:    SID=S-1-5-15-76854ac0-4c527104-321d5138
1327006571: SMB: 6:    DC=’\\<Windows_DC_Servername>’
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x0,0
1327006571: SMB: 6:   Trusted domain:<domain_name>.net [<domain_name>]
1327006571: SMB: 6:    Flags=0x20 Ix=0 Type=0x2 Attr=0x0
1327006571: SMB: 6:    SID=S-1-5-15-88d60754-f3ed4f9d-b3f2cbc4
1327006571: SMB: 6:    DC=’-‘
1327006571: SMB: 6:    Status Flags=0x0 DCStatus=0x547,1355
DC=DC0x0067a82c18 <Windows_DC_Servername>[<domain_name>](10.3.0.5) ref=2 time(getdc187)=0 ms LastUpdt=Thu Jan 19 20:45:14 2012
    Pid=1000 Tid=0000 Uid=0000
    Cnx=UNSUCCESSFUL,DC state Unknown
    logon=Unknown 0 SecureChannel(s):
    Capa=0x0 Nego=0x0000000000,L=0 Chal=0x0000000000,L=0,W2kFlags=0x0
    refCount=2 newElectedDC=0x0000000000 forceInvalid=0
    Discovered from: WINS

To enable or disable a domain controller on the data mover:

.server_config server_2 -v “pdc enable=<ip_address>”  Enable a domain controller

.server_config server_2 -v “pdc disable=<ip_address>”  Disable a domain controller

MemInfo:

 .server_config server_2 -v “meminfo”

Sample Output (truncated):

CPU=0
3552907011 calls to malloc, 3540029263 to free, 61954 to realloc
Size     In Use       Free      Total nallocs nfrees
16       3738        870       4608   161720370   161716632
32      18039      17289      35328   1698256206   1698238167
64       6128       3088       9216   559872733   559866605
128       6438      42138      48576   255263288   255256850
256       8682      19510      28192   286944797   286936115
512       1507       2221       3728   357926514   357925007
1024       2947       9813      12760   101064888   101061941
2048       1086        198       1284    5063873    5062787
4096         26        138        164    4854969    4854943
8192        820         11        831   19562870   19562050
16384         23         10         33       5676       5653
32768          6          1          7        101         95
65536         12          0         12         12          0
524288          1          0          1          1          0
Total Used     Total Free    Total Used + Free
all sizes   18797440   23596160   42393600

MemOwners:

.server_config server_2 -v “help memowners”

Usage:
memowners [dump | showmap | set … ]

Description:
memowners [dump] – prints memory owner description table
memowners showmap – prints a memory usage map
memowners memfrag [chunksize=#] – counts free chunks of given size
memowners set priority=# tag=# – changes dump priority for a given tag
memowners set priority=# label=’string’ – changes dump priority for a given label
The priority value can be set to 0 (lowest) to 7 (highest).

Sample Output (truncated):

1408979513: KERNEL: 6: Memory_Owner dump.
nTotal Frames 1703936 Registered = 75,  maxOwners = 128
1408979513: KERNEL: 6:   0 (   0 frames) No owner, Dump priority 6
1408979513: KERNEL: 6:   1 (3386 frames) Free list, Dump priority 0
1408979513: KERNEL: 6:   2 (40244 frames) malloc heap, Dump priority 6
1408979513: KERNEL: 6:   3 (6656 frames) physMemOwner, Dump priority 7
1408979513: KERNEL: 6:   4 (36091 frames) Reserved Mem based on E820, Dump priority 0
1408979513: KERNEL: 6:   5 (96248 frames) Address gap based on E820, Dump priority 0
1408979513: KERNEL: 6:   6 (   0 frames) Rmode isr vectors, Dump priority 7

Auto transferring reports from VNX to an IIS web server via FTP

I previously posted on how to create a script that monitors your Celerra replication jobs.  I have an intranet web page that is updated daily with many other reports (most of which I’ve posted about here), so I thought I’d add this one to the web page as well rather than having to search through my inbox for it every day.

Developing an easy and automated method of getting files from the Celerra to a windows based web server was my challenge.  I figured out an easy way to do this with FTP.  As my internal windows web server is also my internal FTP server, I can place the file directly in the public folder for easy web publishing.  Now that I’ve got the report working and updating on the intranet page every day my next task will be to come up with a more secure method using SSH or SCP, but this works well for now.

The big challenge in creating a bash script using FTP is figuring out how to pass the user id and password.  I tried various methods unsuccessfully and finally settled on using the .netrc file.  Create an empty file named .netrc in your home directory (in my case I put it in /home/nasadmin) with the following syntax:

machine <ftp_server_name> login <ftp_login_id> password <ftp_password>

Once that is created, you need to do a chmod 600 on the .netrc file in order for it to work.  If the permissions are not set to 600 on that file the auto-login to the FTP server will fail.

My next step was to create the script that sends the replication status report to the IIS web server:

#!/bin/bash
cd /home/nasadmin/scripts
ftp <ftp_server_name> <<SCRIPT
put <filename>.csv
quit
SCRIPT
 I always chmod the script with 755 and +X after creating it in vi.  The script always ran fine manually, but I struggled for a while getting it to work properly when run from crontab.  I figured out that you must cd to the correct directory in the script before you call the ftp command, if not you will get “file not found” errors when you run it.  I was always running it manually from within that directory, so I didn’t immediately catch that problem. 🙂

I then added the above script to crontab on the Celerra.  I run it at 6AM every morning with the following entry:

0 6 * * * /scripts/repl_status.sh

For those not familiar with cron, you can add an entry using “crontab -e”, and list your current entries with “crontab -l”.  The first two entries in the line “0 6” represent minutes and the hour of each day, in this case it will run at 6:00AM every day.

I have a download link to the csv file on my web page, and I also have a script on my web server that converts the csv file to HTML output with a perl script called csv2html.pl so the data can be easily viewed without having to download the csv and open it in excel.  You can find csv2html.pl easily with a google search, I’ve blogged about it in previous posts as well.

That’s it!  An easy way to automatically push your reports to another server from the Celerra.  Now that I have the transfer method down, I’ll be adding more daily reports in the near future.  If anyone has experience doing this type of transfer from a Celerra (or Linux server) to a windows server via SSH or SCP, please comment! 🙂