On our NS-960 Celerra we have multiple storage pools defined and one of them that is specifically defined for checkpoint storage. Out Of the 100 or so file systems that we run a checkpoint schedule on, about 2/3 of them were incorrectly writing their checkpoints to the production file system pool rather than the defined checkpoint pool, and the production pool was starting to fill up. I started to research how you could change where checkpoints are stored. Unfortunately you can’t actually relocate existing checkpoints, you need to start over.
In order to change where checkpoints are stored, you need to stop and delete any running replications (which automatically store root replication checkpoints) and delete all current checkpoints for the specific file system you’re working on. Depending on your checkpoint schedule and how often it runs you may want to pause it temporarily. Having to stop and delete remote replications was painful for me as I compete for bandwidth with our data domain backups to our DR site. Because of that I’ve been working on these one at a time.
Once you’ve deleted the relevant checkpoints and replications, you can choose where to store the new checkpoints by creating a single checkpoint first before making a schedule. From the GUI, go to Data Protection | Snapshots | Checkpoints tab, and click create. If a checkpoint does not exist for the file system, it will give you a choice of which pool you’d like to store it in. Below is what you’ll see in the popup window.
Choose Data Mover [Drop Down List ▼]
Production File System [Drop Down List ▼]
Writeable Checkpoint [Checkbox]:
Data Movers: server_2
Checkpoint Name: [Fill in the Blank]
Configure Checkpoint Storage:
There are no checkpoints currently on this file system. Please specify how to allocate
storage for this and future checkpoints of this file system.
*Storage Pool [Option Box]
*Meta Volume [Option Box]
Current Storage System: CLARiiON CX4-960 APM00900900999
Storage Pool: [Drop Down List ▼] <—*Select the appropriate Storage Pool Here*
Storage Capacity (MB): [Fill in the Blank]
Auto Extend Configuration:
High Water Mark: [Drop Down List for Percentage ▼]
Maximum Storage Capacity (MB): [Fill in the Blank]
Where did all my savvol space go? I noticed last week that some of my Celerra replication jobs had stalled and were not sending any new data to the replication partner. I then noticed that the storage pool designated for checkpoints was at 100%. Not good. Based on the number of file system checkpoints that we perform, it didn’t seem possible that the pool could be filled up already. I opened a case with EMC to help out.
I learned something new after opening this call – every time you create a replication job, a new checkpoint is created for that job and stored in the savvol. You can view these in Unisphere by changing the “select a type” filter to “all checkpoints including replication”. You’ll notice checkpoints named something like root_rep_ckpt_483_72715_1 in the list, they all begin with root_rep. After working with EMC for a little while on the case, he helped me determine that one of my replication jobs had a root_rep_ckpt that was 1.5TB in size.
Removing that checkpoint would immediately solve the problem, but there was one major drawback. Deleting the root_rep checkpoint first requires that you delete the replication job entirely, requiring a complete re-do from scratch. The entire filesystem would have to be copied over our WAN link and resynchronized with the replication partner Celerra. That didn’t make me happy, but there was no choice. At least the problem was solved.
Here are a couple of tips for you if you’re experiencing a similar issue.
You can verify the storage pool the root_rep checkpoints are using by doing an info against the checkpoint from the command line and look for the ‘pool=’ field.
nas_fs –list | grep root_rep (the first colum in the output is the ID# for the next command)
nas_fs –info id=<id from above>
You can also see the replication checkpoints and IDs for a particular filesystem with this command:
fs_ckpt <production file system> -list –all
You can check the size of a root_rep checkpoint from the command line directly with this command:
./nas/sbin/rootnas_fs -size root_rep_ckpt_883_72715_1
Need to quickly figure out which checkpoint filesystems are taking up all of your precious savvol space? Run the CLI command below. Filling up the savvol storage pool can cause all kinds of problems besides failing checkpoints. It can also cause filesystem replication jobs to fail.
To view it on the screen:
nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’ %40s : %5d : %s\n’ -fields:Name,ID,Size
To save it in a file:
nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’ %40s : %5d : %s\n’ -fields:Name,ID,Size > checkpoints.txt
vi checkpoints.txt (to view the file)
Here’s a sample of the output:
ckpt_ckpt_UserFilesystem_01_monthly_001 : 836 : 220000
ckpt_ckpt_UserFilesystem_01_monthly_002 : 649 : 220000
ckpt_ckpt_UserFilesystem_02_monthly_001 : 836 : 80000
ckpt_ckpt_UserFilesystem_02_monthly_002 : 649 : 80000
The numbers are in MB.