Tag Archives: config

Celerra data mover performance and port configuration

I had a request to review my experience with data mover performance and port configuration on our production Celerras.  When I started supporting our Celerras I had no experience at all, so my current configuration is the result of trial and error troubleshooting and tackling performance problems as they appeared.

To keep this simple, I’ll review my configuration for a Celerra with only one primary data mover and one standby.  There really is no specific configuration needed on your standby data mover, just remember to perfectly match all active network ports on both primary and standby, so in the event of a failover the port configuration matches between the two.

Our primary data mover has two Ethernet modules with four ports each (for a total of eight ports).  I’ll map out how each port is configured and then explain why I did it that way.

Cge 1-0             Failsafe Config for Primary CIFS  (combined with cge1-1), assigned to ‘CIFS1’ prod file server.

Cge 1-1             Failsafe Config for Primary CIFS (combined with cge1-0), assigned to ‘CIFS1’ prod file server.

Cge 1-2             Interface configured for backup traffic, assigned to ‘CIFSBACKUP1’ server, VLAN 1.

Cge 1-3             Interface configured for backup traffic, assigned to ‘CIFSBACKUP2’ server. VLAN 1.

Cge 2-0             Interface configured for backup traffic, assigned to ‘CIFSBACKUP3’ server, VLAN 2.

Cge 2-1             Interface configured for backup traffic, assigned to ‘CIFSBACKUP4’ server, VLAN 2.

Cge 2-2             Interface configured for replication traffic, assigned to replication interconnect.

Cge 2-3             Interface configured for replication traffic, assigned to replication interconnect.

Primary CIFS Server – You do have a choice in this case to use either link aggregation or a fail safe network configuration.  Fail safe is an active/passive configuration.  If one port fails the other will take over.  I chose a fail safe configuration for several reasons, but there are good reasons to choose aggregation as well.  I chose fail safe primarily due to the ease of configuration, as there was no need for me to get the network team involved to make changes to our production switch (fail safe is configured only on the Celerra side), and our CIFS server performance requirements don’t necessitate two active links.  If you need the extra bandwidth, definitely go for aggregation.

I originally set up the fail safe network in an emergency situation, as the single interface to our prod CIFS server went down and could not be brought back online.  EMC’s answer was to reboot the data mover.  That fixed it, but it’s not such a good solution during the middle of a business day.

Backup Interfaces – We were having issues with our backups exceeding the time we had for our backup window.  In order to increase backup performance, I created four additional CIFS servers, all sharing the same file systems as production.  Our backup administrator splits the load on the four backup interfaces between multiple media servers and tape libraries (on different VLANs), and does not consume any bandwidth on the production interface that users need to access the CIFS shares.  This configuration definitely improved our backup performance.

Replication – All of our production file systems are replicated to another Celerra in a different country for disaster recovery purposes.   Because of the huge amount of data that needs to be replicated, I created two interfaces specifically for replication traffic.  Just like the backup interfaces, it separates replication traffic from the production CIFS server interface.  Even with the separate interfaces, I still have imposed a bandwidth limitation (no more than 50MB/s) in the interconnect configuration, as I need to share the same 100MB WAN link with our data domain for replication.

This configuration has proven to be very effective for me.  Our links never hit 100% utilization and I rarely get complaints about CIFS server performance.  The only real performance related troubleshooting I’ve had to do on our production CIFS servers has been related to file system deduplication, I’ve disabled it on certain file systems that see a high amount of activity.

Other thoughts about celerra configuration:

  1. We recently added a third data mover to the Celerra in our HQ data center because of the file system limitation on one data mover.  You can only have 2048 total filesystems on one data mover.  We hit that limitation due to the number of checkpoints that we keep for operational file restores.  If you make a checkpoint of one filesystem twice a day for a month, that would be 61 filesystems used against the 2048 total, which adds up quickly if you have a CIFS server filled with dozens of small shares.  I simply added another CIFS server and all new shares are now created on the new CIFS server.  The names and locations of the shares are transparent to all of our users as all file shares are presented to users with DFS links, so there were no major changes required for our Active Directory/Windows administrators.
  2. Use the Celerra monitor to keep an eye on CPU and Memory usage throughout the day.  Once you launch it from Unisphere, it runs independently of your Unisphere session (unisphere can be closed) and has a very small memory footprint on your laptop/PC.
  3. Always create your CIFS server on VDM’s, especially if you are replicating data for disaster recovery.   VDM’s are designed specifically for windows environments, allow for easy migration between data movers and allow for easy recreation of a CIFS server and it’s shares in a replication/DR scenario.  They store all the information for local groups, shares, security credentials, audit logs, and home directory info.  If you need to recreate a CIFS server from scratch, you’ll need to re-do all of those things from scratch as well.  Always use VDM’s!
  4. Write scripts for monitoring purposes.  I have only one running on my Celerras now that emails me a report of the status all replication jobs in the morning.  Of course, you can put any valid command into a bash script (adding a mailx command to email you the results), stick it in crontab, and away you go.
Advertisements

Powerpath commands in AIX causing unexpected errors / initialization errors.

We recently had a problem with one of our AIX VIO servers not being able to run any powerpath commands.  Any attempt to run a command would result in an unexpected error or initialization error.   After speaking to EMC about it, the root cause is usually either running out of space on the root filesystem or having the data and stack ulimit paramenters set too low after adding a large number of new LUNs.   We are running AIX 6.1 on an IBM pSeries 550 with PowerPath 5.3 HF1.

Here are the errors that were popping up:

root@vioserver1:/script # powermt config
Unexpected error occured.

root@vioserver1:/script # powermt display dev=all
Initialization error.

root@vioserver1:/script # naviseccli -h <san_dns_name> lun -list -all
evp_enc.c(282): OpenSSL internal error, assertion failed: inl > 0
ksh: 503926 IOT/Abort trap(coredump)

Having too many LUNs caused the issue,  we had recently added an additional 35 for a total of  70.  Increasing the data and stack parameters to ‘unlimited’ resolved the problem.

root@vioserver1:/script # ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        unlimited
memory(kbytes)       unlimited
coredump(blocks)     2097151
nofiles(descriptors) 2000
threads(per process) unlimited
processes(per user)  unlimited

Strategies for implementing Multi-tiered FAST VP Storage Pools

After speaking to our local rep and attending many different classes at the most recent EMC World in Vegas, I came away with some good information and a very logical best practice for implementing multi-tiered FAST VP storage pools.

First and foremost, you have to use Flash.  High RPM Fiber Channel drives are neighter capactiy efficient or performance efficient, the highest IO data needs to be hosted on Flash drives.  The most effective split of drives in a storage pool is 5% Flash, 20% Fiber Channel, and 75% SATA.

Using this example, if you have an existing SAN with 167 15,000 RPM 600GB Fiber Channel Drives, you would replace them with 97 drives in the 5/20/75 blend to get the same capacity with much improved performance:

  • 25 200GB Flash Drives
  • 34 15K 600GB Fiber Channel Drives
  • 38 2TB SATA Drives

The ideal scenario is to implement FAST Cache along with FAST VP.  FAST Cache continously ensures that the hottest data is serverd from Flash Drives.  With FAST Cache, up to 80% of your data IO will come from Cache (Legacy DRAM Cache served up only about 20%).

It can be a hard pill to swallow when you see how much the Flash drives cost, but their cost is negated by increased disk utilization and reduction in the number of total drives and DAEs that you need to buy.   With all FC drives, disk utilization is sacrificed to get the needed performance – very little of the capacity is used, you just buy tons of disks in order to get more spindles in the raid groups for better performance.  Flash drives can achieve much higher utilization, reducing the effective cost.

After implementing this at my company I’ve seen dramatic performance improvements.  It’s an effective strategy that really works in the real world.

In addition to this, I’ve also been implementing storage pools in pairs of two, each sized identically.  The first pool is designated only for SP A, the second is for SPB.  When I get a request for data storage, in this case let’s say for 1 TB, I will create a 500GB LUN in the first pool on SP A, and a 500GB LUN in the second pool on SP B.  When the disks are presented to the host server, the server administrator will then stripe the data across the two LUNs.  Using this method, I can better balance the load across the storage processors on the back end.