Tag Archives: tiering

Long Running FAST VP relocation job

I’ve noticed that our auto-tier data relocation job that runs every evening consistently shows 2+days for the estimated time of completion. We have it set to run only 8 hours per day, so with our current configuration it’s likely the job will never reach a completed state. Based on that observation I started investigating what options I had to try and reduce the amount of time that the relocation jobs runs.

Running this command will tell you the current amount of time estimated to complete the relocation job data migrations and how much data is queued up to move:

Naviseccli –h <clarion_ip> autotiering –info –opStatus

Auto-Tiering State: Enabled
Relocation Rate: Medium
Schedule Name: Default Schedule
Schedule State: Enabled
Default Schedule: Yes
Schedule Days: Sun Mon Tue Wed Thu Fri Sat
Schedule Start Time: 22:00
Schedule Stop Time: 6:00
Schedule Duration: 8 hours
Storage Pools: Clariion1_SPB, Clariion2_SPA
Storage Pool Name: Clariion2_SPA
Storage Pool ID: 0
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 2854.11
Data to Move Down (GBs): 1909.06
Data Movement Completed (GBs): 2316.00
Estimated Time to Complete: 2 days, 9 hours, 12 minutes
Schedule Duration Remaining: None
 

I’ll review some possibilities based on research I’ve done in the past few days.  I’m still in the evaluation process and have not made any changes yet, I’ll update this blog post once I’ve implemented a change myself.  If you are having issues with your data relocation job not finishing I would recommend opening an SR with EMC support for a detailed analysis before implementing any of these options.

1. Reduce the number of LUNs that use auto-tiering by disabling it on a LUN-by-LUN basis.

I would recommend monitoring which LUNs have the highest rate of change when the relocation job runs and then evaluate if any can be removed from auto-tiering altogether.  The goal of this would be to reduce the amount of data that needs to be moved.  The one caveat with this process is that when a LUN has auto-tiering disabled, the tier distribution of the LUN will remain exactly the same from the moment it is disabled.  If you disable it on a LUN that is using a large amount of EFD it will not change unless you force it to a different tier or re-enable auto-tiering later.

This would be an effective way to reduce the amount of data being relocated, but the process of determining which LUNs should have auto-tiering disabled is subjective and would require careful analysis.

2. Reset all the counters on the relocation job.

Any incorrectly labeled “hot” data will be removed from the counters and all LUNs would be re-evaluated for data movement.  One of the potential problems with auto-tiering is with servers that have IO intensive batch jobs that run infrequently.  That data would be incorrectly labeled as “hot” and scheduled to move up even though the server is not normally busy.  This information is detailed in emc268245.

To reset the counters, use the command to stop and start autotiering:

Naviseccli –h <clarion_ip> autotiering –relocation -<stop | start>

If you need to temporarily stop replication and do not want to reset the counters, use the pause/resume command instead:

Naviseccli –h <clarion_ip> autotiering –relocation -<pause | resume>

I wanted to point out that changing a specific LUN from “auto-tier” to “No Movement” also does not reset the counters, the LUN will maintain it’s tiering schedule. It is the same as pausing auto-tiering just for that LUN.

3. Increase free space available on the storage pools.

If your storage pools are nearly 100% utilized there may not be enough space to effectively migrate the data between the tiers.  Add additional disks to the pool, or migrate LUNs to other RAID groups or storage pools.

4. Increase the relocation rate.

This of course could have dramatic effects on IO performance if it’s increased and it should only be changed during periods of measured low IO activity.

Run this command to change the data relocation rate:

Naviseccli –h <clarion_ip> autotiering –setRate –rate <high | medium | low>

5. Use a batch or shell script to pause and restart the job with the goal of running it more frequently during periods of low IO activity.

There is no way to set the relocation schedule to run at different times on different days of the week, a script is necessary to accomplish that.  I currently run the job only in the middle of the night during off peak (non-business) hours, but I would be able to run it all weekend as well.  I have done that manually in the past.

You would need to use an external windows or unix server to schedule the scripts.  The relocation schedule should be set to run 24×7, then add the pause/resume command to have the job pause during the times you don’t want it to run.  To have it run on weekends and overnight, set up two separate scripts (one for pause and one for resume), then schedule each with task scheduler or cron to run throughout the week.

The cron schedule below would allow it to run from 10PM to 6AM on weeknights and from 10PM to 6AM on Monday over the weekend.

pause.sh:       Naviseccli –h <clarion_ip> autotiering –relocation –pause

resume.sh:   Naviseccli –h <clarion_ip> autotiering –relocation -resume

0 6 * * *  /scripts/pause.sh        @6AM on Monday – pause
0 10 * * * /scripts/resume.sh    @10PM on Monday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Tuesay – pause
0 10 * * * /scripts/resume.sh    @10PM on Tuesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Wednesday – pause
0 10 * * * /scripts/resume.sh    @10PM on Wednesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Thursday – pause
0 10 * * * /scripts/resume.sh    @10PM on Thursday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Friday – pause
0 10 * * * /scripts/resume.sh    @10PM on Friday – resume
<Do not pause again until Monday morning>
 
Advertisements

Reporting on the state of VNX auto-tiering

 

To go along with my previous post (reporting on LUN tier distribution) I also include information on the same intranet page about the current state of the auto-tiering job.  We run auto-tiering from 10PM to 6AM in the morning to avoid the movement of data during business hours or our normal backup window in the evening.

Sometimes the auto-tiering job will get very backed up and would theoretically never finish in the time slot that we have for data movement.  I like to keep tabs on the amount of data that needs to move up or down, and the amount of time that the array estimates until it’s completion.  If needed, I will sometimes modify the schedule to run 24 hours a day over the weekend and change it back early on Monday morning.  Unfortunately, EMC did not design the auto-tiering scheduler to allow for creating different time windows on different days. It’s a manual process.

This is a relatively simple, one line CLI command, but it provides very useful info and it’s convenient to add it to a daily report to see it at a glance.

I run this script at 6AM every day, immediately following the end of the window for data to move:

naviseccli -h clariion1_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion1_hostname.autotier.txt

naviseccli -h clariion2_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion2_hostname.autotier.txt

naviseccli -h clariion3_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion3_hostname.autotier.txt

naviseccli -h clariion4_hostname autotiering -info -state -rate -schedule -opStatus > c:\inetpub\wwwroot\clariion4_hostname.autotier.txt

 ....
 The output for each individual clariion looks like this:
Auto-Tiering State: Enabled
Relocation Rate: Medium

Schedule Name: Default Schedule
Schedule State: Enabled
Default Schedule: Yes
Schedule Days: Sun Mon Tue Wed Thu Fri Sat
Schedule Start Time: 22:00
Schedule Stop Time: 6:00
Schedule Duration: 8 hours
Storage Pools: Clariion1_SPB, Clariion2_SPA

Storage Pool Name: Clariion2_SPA
Storage Pool ID: 0
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 1854.11
Data to Move Down (GBs): 909.06
Data Movement Completed (GBs): 2316.00
Estimated Time to Complete: 9 hours, 12 minutes
Schedule Duration Remaining: None

Storage Pool Name: Clariion1_SPB
Storage Pool ID: 1
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 1757.11
Data to Move Down (GBs): 878.05
Data Movement Completed (GBs): 1726.00
Estimated Time to Complete: 11 hours, 42 minutes
Schedule Duration Remaining: None
 
 

Strategies for implementing Multi-tiered FAST VP Storage Pools

After speaking to our local rep and attending many different classes at the most recent EMC World in Vegas, I came away with some good information and a very logical best practice for implementing multi-tiered FAST VP storage pools.

First and foremost, you have to use Flash.  High RPM Fiber Channel drives are neighter capactiy efficient or performance efficient, the highest IO data needs to be hosted on Flash drives.  The most effective split of drives in a storage pool is 5% Flash, 20% Fiber Channel, and 75% SATA.

Using this example, if you have an existing SAN with 167 15,000 RPM 600GB Fiber Channel Drives, you would replace them with 97 drives in the 5/20/75 blend to get the same capacity with much improved performance:

  • 25 200GB Flash Drives
  • 34 15K 600GB Fiber Channel Drives
  • 38 2TB SATA Drives

The ideal scenario is to implement FAST Cache along with FAST VP.  FAST Cache continously ensures that the hottest data is serverd from Flash Drives.  With FAST Cache, up to 80% of your data IO will come from Cache (Legacy DRAM Cache served up only about 20%).

It can be a hard pill to swallow when you see how much the Flash drives cost, but their cost is negated by increased disk utilization and reduction in the number of total drives and DAEs that you need to buy.   With all FC drives, disk utilization is sacrificed to get the needed performance – very little of the capacity is used, you just buy tons of disks in order to get more spindles in the raid groups for better performance.  Flash drives can achieve much higher utilization, reducing the effective cost.

After implementing this at my company I’ve seen dramatic performance improvements.  It’s an effective strategy that really works in the real world.

In addition to this, I’ve also been implementing storage pools in pairs of two, each sized identically.  The first pool is designated only for SP A, the second is for SPB.  When I get a request for data storage, in this case let’s say for 1 TB, I will create a 500GB LUN in the first pool on SP A, and a 500GB LUN in the second pool on SP B.  When the disks are presented to the host server, the server administrator will then stripe the data across the two LUNs.  Using this method, I can better balance the load across the storage processors on the back end.

Tiering reports for EMC’s FAST VP

Note: On a separate blog post, I shared a script to generate a report of the tiering status of all LUNs.

One of the items that EMC did not implement along with FAST VP is the ability to run a canned report on how your LUNs are being allocated among the different tiers of storage.  While there is no canned report, alas, it is possible to get this information from the CLI.

The naviseccli –h {SP IP or hostname} lun –list –tiers command fits the bill. It shows how a specific LUN is distributed across the different drive types.  I still need to come up with a script to pull out only the information that I want, but the info is definitely in the command’s output.

Here’s the sample output:

LOGICAL UNIT NUMBER 6
 Name:  LUN 6
 Tier Distribution:
 Flash:  13.83%
 FC:  86.17%

The storagepool report gives some good info as well.  Here’s an excerpt of what you see with the naviseccli –h {SP IP or hostname} storagepool –list –tiers command:

SPA

Tier Name:  Flash
 Raid Type:  r_5
 User Capacity (GBs):  1096.07
 Consumed Capacity (GBs):  987.06
 Available Capacity (GBs):  109.01
 Percent Subscribed:  90.05%
 Data Targeted for Higher Tier (GBs):  0.00
 Data Targeted for Lower Tier (GBs):  11.00

Tier Name:  FC
 Raid Type:  r_5
 User Capacity (GBs):  28981.77
 Consumed Capacity (GBs):  10592.65
 Available Capacity (GBs):  18389.12
 Percent Subscribed:  36.55%

Tier Name:  SATA
 Raid Type:  r_5
 User Capacity (GBs):  11004.67
 Consumed Capacity (GBs):  260.02
 Available Capacity (GBs):  10744.66
 Percent Subscribed:  2.36%
 Data Targeted for Higher Tier (GBs):  3.00
 Data Targeted for Lower Tier (GBs):  0.00
 Disks (Type):

SPB

Tier Name:  Flash
 Raid Type:  r_5
 User Capacity (GBs):  1096.07
 Consumed Capacity (GBs):  987.06
 Available Capacity (GBs):  109.01
 Percent Subscribed:  90.05%
 Data Targeted for Higher Tier (GBs):  0.00
 Data Targeted for Lower Tier (GBs):  25.00

Tier Name:  FC
 Raid Type:  r_5
 User Capacity (GBs):  28981.77
 Consumed Capacity (GBs):  10013.61
 Available Capacity (GBs):  18968.16
 Percent Subscribed:  34.55%
 Data Targeted for Higher Tier (GBs):  25.00
 Data Targeted for Lower Tier (GBs):  0.00

Tier Name:  SATA
 Raid Type:  r_5
 User Capacity (GBs):  11004.67
 Consumed Capacity (GBs):  341.02
 Available Capacity (GBs):  10663.65
 Percent Subscribed:  3.10%
 Data Targeted for Higher Tier (GBs):  20.00
 Data Targeted for Lower Tier (GBs):  0.00

Good stuff in there.   It’s on my to-do list to run these commands periodically, and then parse the output to filter out only what I want to see.  Once I get that done I’ll post the script here too.

Note: I did create and post a script to generate a report of the tiering status of all LUNs.