Making a case for file archiving

We’ve been investigating options for archiving unstructured (file based) data that resides on our Celerra for a while now. There are many options available, but before looking into a specific solution I was asked to generate a report that showed exactly how much of the data has been accessed by users for the last 60 days and for the last 12 months.  As I don’t have permissions to the shared folders from my workstation I started looking into ways to run the report directly from the Celerra control station.  The method I used will also work on VNX File.

After a little bit of digging I discovered that you can access all of the file systems from the control station by navigating to /nas/quota/slot_.  The slot_2 folder would be for the server_2 data mover, slot_3 would be for server_3, etc.  With full access to all the file systems, I simply needed to write a script that scanned each folder and counted the number of files that had been modified within a certain time window.

I always use excel for scripts I know are going to be long.  I copy the file system list from Unisphere then put the necessary commands in different columns, and end it with a concatenate formula that pulls it all together.  If you put echo -n in A1, “Users_A,” in B1, and >/home/nasadmin/scripts/Users_A.dat in C1, you’d just need to type the formula “=CONCATENATE(A1,B1,C1)” into cell D1.  D1 would then contain echo -n “Users_A,” > /home/nasadmin/scripts/Users_A.dat. It’s a simple and efficient way to make long scripts very quickly.

In this case, the script needed four different sections.  All four of these sections I’m about to go over were copied into a single shell script and saved in my /home/nasadmin/scripts directory.  After creating the .sh file, I always do a chmod +X and chmod 777 on the file.  Be prepared for this to take a very long time to run.  It of course depends on the number of file systems on your array, but for me this script took about 23 hours to complete.

First, I create a text file for each file system that contains the name of the filesystem (and a comma) which is used later to populate the first column of the final csv output.  It’s of course repeated for each file system.

echo -n "Users_A," > home/nasadmin/scripts/Users_A.dat
echo -n "Users_B," > home/nasadmin/scripts/Users_B.dat

... <continued for each filesystem>
 Second, I use the ‘find’ command to walk each directory tree and count the number of files that were accessed over 60 days ago.  The output is written to another text file that will be used in the csv output file later.
find /nas/quota/slot_2/ Users_A -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_A_wc.dat

find /nas/quota/slot_2/ Users_B -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_B_wc.dat

... <continued for each filesystem>
 Third, I want to count the total number of files in each file system.  A third text file is written with that number, again for the final combined report that’s generated at the end.
find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

... <continued for each filesystem>
 Finally, each file is combined into the final report.  The output will show each filesystem with two columns, Total Files & Files Accessed 60 days ago.  You can then easily update the report in Excel and add columns that show files accessed in the last 60 days, the percentage of files accessed in the last 60 days, etc., with some simple math.
cat /home/nasadmin/scripts/Users_A.dat /home/nasadmin/scripts/Users_A_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_A_total.dat | tr -d "\n" > /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

cat /home/nasadmin/scripts/Users_B.dat /home/nasadmin/scripts/Users_B_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_B_total.dat | tr -d "\n" >> /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

... <continued for each filesystem>

My final output looks like this:

Total Files Accessed 60+ days ago Accessed in Last 60 days % Accessed in last 60 days
Users_A            827,057                734,848               92,209                                                   11.15
Users_B              61,975                  54,727                 7,248                                                   11.70
Users_C            150,166                132,457               17,709                                                   11.79

The three example filesystems above show that only about 11% of the files have been accessed in the last 60 days.   Most user data has a very short lifecycle, it’s ‘hot’ for a month or less then dramatically tapers off as the business value of the data drops.  These file systems would be prime candidates for archiving.

My final report definitely supported the need for archving, but we’ve yet to start a project to complete it.  I like the possibility of using EMC’s cloud tiering appliance which can archive data directly to the cloud service of your choice.  I’ll make another post in the future about archiving solutions once I’ve had more time to research it.


Leave a Reply