Category Archives: guides-emc

Isilon Port Usage

Below is a table of Isilon port usage and the OneFS services that use them.   Additional detail is available in the Isilon Security Configuration guide on Dell EMC’s support site.

Affected Services Port Service Protocol Connection Type
FTP 20 ftp-data TCP, IPv4, IPv6 External, Outbound
FTP 21 ftp TCP, IPv4, IPv6 External, Inbound
SSH 22 ssh TCP, IPv4, IPv6 External, Inbound
Telnet 23 telnet TCP External, Inbound
SMTP 25 smtp TCP, IPv4 External, Outbound
SmartConnect 53 domain TCP, UDP, IPv4 External, Outbound
SmartConnect 53 domain UDP, IPv4 External, Inbound
HTTP 80 http TCP, IPv4, IPv6 External, Inbound
Kerberos 88 kerberos TCP, UDP, IPv4, IPv6 External, Outbound
Portmapper 111 sunrpc TCP, UDP, IPv4, IPv6 External, Inbound
Time Service 123 ntp UDP, IPv4, IPv6 External, Inbound
Netbios 137 netbios-ns IPv4 External, Inbound
Netbios 138 netbios-gdm IPv4 External, Inbound
Netbios 139 netbios-ssn TCP, IPv4 External, Inbound
SNMP 161 snmp UDP, IPv4 External, Inbound
SNMP Traps 162 snmptrap UDP, IPv4 External, Inbound
NFS 2049 nfsd TCP, UDP, IPv4, IPv6 External, Inbound
NFSv3 Mount 300 nfsmountd TCP, UDP, IPv4, IPv6 External, Inbound
NFSv3 Notifications 302 nfsstatd TCP, UDP, IPv4, IPv6 External, Inbound
NFSv3 Locking 304 nfslockd TCP, UDP, IPv4, IPv6 External, Inbound
DNS Caching 307 isi-cbind_d UDP, IPv4 External, Inbound
LDAP 389 ldap TCP, IPv4, IPv6 External, Outbound
LDAP 636 ldap TCP, IPv4, IPv6 External, Outbound
HTPS 443 https TCP, IPv4, IPv6 External, Inbound
SMB1/2 Services 445 microsoft-ds TCP, IPv4 External, Outbound
Syslog 514 syslog TCP, IPv4 Internal, Inbound
MSDP 639 msdp UDP, IPv4 Internal
Entrust SPS 640 entrust-sps UDP, IPv4 Internal
Secure FTPS 989 ftps-data TCP, IPv4, IPv6 External, Outbound
Secure FTPS 990 ftps TCP, IPv4, IPv6 External, Inbound
SyncIQ 2098 isi_repl_pworker TCP, IPv4, IPv6 External, Inbound
SyncIQ 3148 isi_repl_bandwidth TCP, IPv4, IPv6 External, Inbound
SyncIQ 3149 isi_repl_bandwidth TCP, IPv4, IPv6 External, Inbound
SyncIQ 5667 isi_migr_sworker TCP, IPv4, IPv6 External, Inbound
iSCSI 3260 iscsi-target TCP, IPv4, IPv6 External, Inbound
MS AD Global Catalog 3268 n/a TCP, IPv4 External, Outbound
ISI Stats 6116 isi_stats_d External, Inbound
ISI Stats 7117 isi_stats_d External, Inbound
HDFS (Hadoop) 8020 hdfs TCP External, Inbound
HDFS (Hadoop) 8021 hdfs TCP, IPv4, IPv6 External, Inbound
Isilon WebGUI (https) 8080 n/a TCP, IPv4, IPv6 External, Inbound
REST API (https) 8080 n/a TCP, IPv4, IPv6 External, Inbound
VASA (vCenter) 8081 vasa TCP External, Inbound
Advertisements

Diving in to Isilon SyncIQ and SnapshotIQ Management

In this post I’m going to review the most useful commands for managing SyncIQ replication jobs and SnapshotIQ snapshots on the Isilon.  While this will primarily be a CLI administration reference, I’ll look at some WebUI options as well when I get to Snapshots, as well as some additional notes and caveats regarding snapshot management.  I’d highly recommend reviewing EMC’s SnapshotIQ best practices page, as well as the SyncIQ best practices guide if you’re just starting a new implementation.  For a complete Isilon Command line reference you can reference this post.

Creating a Replication policy

# isi sync policies create sync –schedule “” –target-snapshot-archive on –target-snapshot-pattern “%{PolicyName}-%{SrcCluster}-%Y-%m-%d_%H-%M”

Viewing active replication jobs

# isi sync jobs list

Policy Name ID State Action Duration
 -----------------------------------------------
 Replica1 32375 running run 1M1W5D14H55m
 ------------------------------------------------
 Total: 1

# isi sync jobs view

Policy Name: Replica1
 ID: 32375
 State: running
 Action: run
 Duration: 1M1W5D14H55m9s
 Start Time: 2017-10-27T17:00:25

# isi_classic sync job rep

Name | Act | St | Duration | Transfer | Throughput
----------+------+---------+------------------+--------+-----------
 Replica1 | sync | Running | 42 days 14:59:23 | 3.0 TB | 6.8 Mb/s

# isi_classic sync job rep –v [Provides a more verbose report]

Creating a SyncIQ domain [Required for failback operations]

# isi job jobs start –root –dm-type SyncIQ

Reviewing a replication Job before starting it

Replication policy status can be reviewed with the ‘test’ option. It is useful for previewing the size of the data set that will be transferred if you run the policy.

# isi sync jobs start –test
# isi sync reports view 1

Replication policy Enable/Disable/Delete

# isi sync policies enable # isi sync policies disable # isi sync policies delete

Replication Job Management

# isi sync jobs start # isi sync jobs pause # isi sync jobs resume # isi sync jobs cancel

Replication Policy Management

# isi sync policies list
# isi sync policies view

Viewing replication policies that target the local cluster

# isi sync target list
# isi sync target view

Managing replication performance rules

# isi sync rules create

Create network traffic rules that limit replication bandwidth

# isi sync rules create bandwidth 00:00-23:59 Sun-Sat 19200 [Limit consumption to 19200 kbps per second, 24×7]
# isi sync rules create file_count 08:00-18:00 M-F 7 [Limit file-send rate to 7 files per second 8-6 on weekdays]

Managing replication performance rules

# isi sync rules list
# isi sync rules view –id bw-0
# isi sync rules modify bw-0 –enabled true
# isi sync rules modify bw-0 –enabled false

Managing replication reports

# isi sync reports list
# isi snapshots list | head -200 [list the first 200 snapshots]
# isi sync reports view 2
# isi sync reports subreports list 1 [view sub-reports]

Managing failed replication jobs

# isi sync policies resolve [Resolve a policy error]
# isi sync policies reset If the issue can’t be resolved, the job can be reset. Resetting a policy results in a full or differential replication the next time the policy is run.

Creating Snapshots

# isi snapshot snapshots create

# isi snapshot snapshots delete {|–schedule |–type {alias|real}|–all}
[{–force|-f}] [{–verbose|-v}]

Modifing Snapshots

# isi snapshot snapshots modify

Listing Snapshots

# isi snapshot snapshots list –state {all | active | deleting}
# isi snapshot snapshots list –limit | -l [Number of snapshots to display]
# isi snapshot snapshots list –descending | -d [Sort data in descending order]

Viewing Snapshots

# isi snapshot snapshots view

Deleting Snapshots

Deleting a snapshot from OneFS is an all-or-nothing event, an existing snapshot cannot be partially deleted. Snapshots are created at the directory level, not at the volume level, which allows for a higher degree of granularity. Because they are a point in time copy of a specific subset of OneFS data they can’t be changed, only fully deleted. When deleting a snapshot OneFS immediately modifies some of the tracking data and the snapshot dissappears from view. Despite the fact that the snap is no longer visible, the behind the scenes cleanup of the snapshot will still be pending. It is performed in the ‘SnapshotDelete’ job.

OneFS frees disk space occupied by deleted snapshots only when the snapshot delete job is run. If a snapshot is deleted that contains clones or cloned files, the data in a shadow store may no longer be referenced by files on the cluster. OneFS deletes unreferenced data in a shadow store when the shadow store delete job is run. OneFS automatically runs both the shadow store delete and snapshot delete jobs, but you can also run them manually any time. Follow the procedure below to force the snapshot delete job to more quickly reclaim array capacity.

Deleting Snapshots from the WebUI

Go to Data Protection > SnapshotIQ > Snapshots and specify the snapshots that you want to delete.

• For each snapshot you want to delete, in the Saved File System Snapshots table, in the row of a snapshot, select the check box.
• From the Select an action list, select Delete.
• In the confirmation dialog box, click Delete.
• Note that you can select more than one snapshot at a time, and clicking the delete button on any of the snapshots will result in the entire checked list being deleted.
• If you have a large number of snapshots and want to delete them all, you can run a command from the CLI that will delete all of them at once: isi snapshot snapshots delete –all.

Increasing the Speed of Snapshot Deletion from the WebUI

It’s important to note that the SnapshotDelete will only run if the cluster is in a fully available state. There can be no drives or nodes down and it cannot be in a degraded state. To increase the speed at which deleted snapshot data is freed on the cluster, run the snapshot delete job.

• Go to Cluster Management > Operations.
• In the Running Jobs area, click Start Job.
• From the Job list, select SnapshotDelete.
• Click Start.

Increasing the Speed of Cloned File deletion from the WebUI

Run the shadow store delete job only after you run the snapshot delete job.

• Go to Cluster Management > Operations.
• In the Running Jobs area, click Start Job.
• From the Job list, select ShadowStoreDelete.
• Click Start.

Reserved Space

There is no requirement for reserved space for snapshots in OneFS. Snapshots can use as much or little of the available file system space as desirable. The oldest snapshot can be deleted very quickly. An ordered deletion is the deletion of the oldest snapshot of a directory, and is a recommended best practice for snapshot management. An unordered deletion is the removal of a snapshot that is not the oldest in a directory, and can often take approximately twice as long to complete and consume more cluster resources than ordered deletions.

The Delete Sequence Matters

As I just mentioned, avoid deleting snapshots from the middle of a time range whenever possible. Newer snapshots are mostly pointers to older snapshots, and they look like they are consuming more capacity than they actually are. Removing the newer snapshots will not free up much space, while deleting the oldest snapshot will ensure you are actually freeing up the space. You can determine snapshot order by using the isi snapshot list -l command.

Watch for SyncIQ Snaps

Avoid deleting SyncIQ snapshots if possible. They are easily identifiable, as they will all be prefixed with SIQ. It is ok to delete them if they are the only remaining snapshots on the cluster, and the only way to free up space is to delete them. Be aware that deleting SyncIQ snapshots resets the SyncIQ policy state, which requires a reset of the policy and may result in either a full sync or initial differential sync. A full sync or initial diff sync could take many times longer than a regular snapshot-based incremental sync.

Using the InsightIQ iiq_data_export Utility

InsightIQ includes a very useful data export tool:  iiq_data_export. It can be used with any version of OneFS beginning with 7.x.  While the tool is compatible with older versions of the operating system, if you’re running OneFS v8.0 or higher it offers a much needed performance improvement.  The improvements allow this to be a much more functional tool that can be used daily, and for quick reports it’s much faster than relying on the web interface.

Applications of this tool could include daily reports for application teams to monitor their data consumption, charge-back reporting processes,  or administrative trending reports. The output is in csv format, so there are plenty of options for data manipulation and reporting in your favorite spreadsheet application.

The utility is a command line tool, so you will need to log in to the CLI with an ssh session to the Linux InsightIQ server.  I generally use putty for that purpose.  The utility works with either root or non-root users, so you won’t need elevated privileges – I log in with the standard administrator user account. The utility can be used to export both performance stats and file system analytics [fsa] data, but I’ll review some uses of iiq_data_export for file system analytics first, more specifically the directory data-module export option.

The default command line option for file system analytics include list, describe, and export:

iiq_data_export fsa [-h] {list,describe,export} ...

Options:
 -h, --help Show this help message and exit.

Sub-Commands:
 {list,describe,export}
 FSA Sub-Commands
 list List valid arguments for the different options.
 describe Describes the specified option.
 export Export FSA data to a specified .csv file.

Listing FSA results for a specific Cluster

First we’ll need to review the reports that are available on the server. Below is the command to list the available FSA results for the cluster:

iiq_data_export fsa list --reports IsilonCluster1

Here are the results of running that command on my InsightIQ Server:

[administrator@corporate_iq1 ~]$ iiq_data_export fsa list --reports IsilonCluster1

Available Reports for: IsilonCluster1 Time Zone: PST
 ====================================================================
 | ID    | FSA Job Start         | FSA Job End           | Size     |
 ====================================================================
 | 57430 | Jan 01 2018, 10:01 PM | Jan 01 2018, 10:03 PM | 115.49M  |
 --------------------------------------------------------------------
 | 57435 | Jan 02 2018, 10:01 PM | Jan 02 2018, 10:03 PM | 115.53M  |
 --------------------------------------------------------------------
 | 57440 | Jan 03 2018, 10:01 PM | Jan 03 2018, 10:03 PM | 114.99M  |
 --------------------------------------------------------------------
 | 57445 | Jan 04 2018, 10:01 PM | Jan 04 2018, 10:03 PM | 116.38M  |
 --------------------------------------------------------------------
 | 57450 | Jan 05 2018, 10:00 PM | Jan 05 2018, 10:03 PM | 115.74M  |
 --------------------------------------------------------------------
 | 57456 | Jan 06 2018, 10:00 PM | Jan 06 2018, 10:03 PM | 114.98M  |
 --------------------------------------------------------------------
 | 57462 | Jan 07 2018, 10:01 PM | Jan 07 2018, 10:03 PM | 113.34M  |
 --------------------------------------------------------------------
 | 57467 | Jan 08 2018, 10:00 PM | Jan 08 2018, 10:03 PM | 114.81M  |
 ====================================================================

The ID column is the job number that is associated with that particular FS Analyze job engine job.  We’ll use that ID number when we run the iiq_data_export to extract the capacity information.

Using iiq_data_export

Below is the command to export the first-level directories under /ifs from a specified cluster for a specific FSA job:

iiq_data_export fsa export -c <cluster_name> --data-module directories -o <jobID>

If I want to view the /ifs subdirectores from job 57467, here’s the command syntax and it’s output:

[administrator@corporate_iq1 ~]$ iiq_data_export fsa export -c IsilonCluster1 --data-module directories -o 57467

Successfully exported data to: directories_IsilonCluster1_57467_1515522398.csv

Below is the resulting file. The output shows the directory count, file counts, logical, and capacity consumption.

[administrator@corporate_iq1 ~]$ cat directories_IsilonCluster1_57467_1515522398.csv

path[directory:/ifs/],dir_cnt (count),file_cnt (count),ads_cnt,other_cnt (count),log_size_sum (bytes),phys_size_sum (bytes),log_size_sum_overflow,report_date: 1515470445
 /ifs/NFS_exports,138420,16067265,0,1659,335841902399477,383999799732224,0
 /ifs/data,95,2189,0,0,13303199652,15264802304,0
 /ifs/.isilon,3,22,0,0,647236,2284544,0
 /ifs/netlog,2,5,0,0,37615,208384,0
 /ifs/home,9,31,0,0,30070,950784,0
 /ifs/SITE,10,0,0,0,244,53248,0
 /ifs/PRODUCTION-CIFS,2,0,0,0,23,4096,0
 /ifs/WAREHOUSE,1,0,0,0,0,2048,0
 /ifs/upgrade_error_logs,1,0,0,0,0,2048,0

While that is a useful top level report, we may want to dive a bit deeper and report on 2nd or 3rd level directories as well. To gather that info, use the directory filter option, which is “-r”:

iiq_data_export fsa export -c <cluster_name> --data-module directories -o <jobID> -r directory:<directory_path_in_ifs>

As an example, if we wanted more detail on the subfolders under the /NFS_exports/warehouse/ directory, we’d run the following command:

[administrator@corporate_iq1 ~]$ iiq_data_export fsa export -c IsilonCluster1 --data-module directories -o 57467 -r directory:/NFS_exports/warehouse/warehouse_dec2017

Successfully exported data to: directories_IsilonCluster1_57467_1515524307.csv

Below is the output from the csv file that I generated:

[administrator@corporate_iq1 ~]$ cat directories_IsilonCluster1_57467_1515524307.csv

path[directory:/ifs/NFS_exports/warehouse/warehouse_dec2017/],dir_cnt (count),file_cnt (count),ads_cnt,other_cnt (count),log_size_sum (bytes),phys_size_sum (bytes),log_size_sum_overflow,report_date: 1515470445
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_t01,44,458283,0,0,27298994838926,31275791237632,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_cat,45,106854,0,0,14222018137340,16285929507840,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_set,24,261564,0,0,11221057700000,12847989286912,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_auth,17,96099,0,0,7402828037356,8471138941440,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_mds,41,457984,0,0,5718188746729,6576121923584,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_hsh,17,101969,0,0,4396244719797,5035400875520,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_hop,17,115257,0,0,3148118026139,3608613813760,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_brm,24,3434,0,0,2964319382819,3381774883840,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_exe,9,22851,0,0,2917582971428,3317971597824,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_com,21,33286,0,0,2548672643701,2907729505280,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_mig,2,30,0,0,2255138307994,2586591986688,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_cls,7,4994,0,0,1795466785597,2035911001088,0
 /ifs/NFS_exports/warehouse/warehouse_dec2017/dir_enc,45,106713,0,0,1768636398516,2032634691072,0
 <...truncated>

Diving Deeper into subdirectories

Note that how deep you can go down the /ifs subdirectory tree depends on the FSA configuration in InsightIQ. By default InsightIQ will configure the “directory filter maximum depth” option to 5, allowing directory information as low as
/ifs/dir1/dir2/dir3/dir4/dir5. If you need to dive deeper the FSA config will need to be updated. To do so, go to the Configuration Page, FSA Configuration, then the “Directory Filter path_squash) maximum depth setting. Note that the larger the maximum depth the more storage space an individual FSA result will use.

Scripting Reports

For specific subdirectory reports it’s fairly easy to script the output.

First, let’s create a text file with a list of the subdirectories under /ifs that we want to report on. I’ll create a file named “directories.txt” in the /home/administrator folder on the InsightIQ server. You can use vi to create and save the file.

[administrator@corporate_iq1 ~]$ vi directories.txt

[add the following in the vi editor...]

NFS_exports/warehouse/warehouse_dec2017/dir_t01
 NFS_exports/warehouse/warehouse_dec2017/dir_cat
 NFS_exports/warehouse/warehouse_dec2017/dir_set

I’ll then use vi again to create the script itself.   You will need to substitute the cluster name and the job ID to match your environment.

[administrator@corporate_iq1 ~]$ vi direxport.sh

[add the following in the vi editor...]

for i in `cat directories.txt`
 do
 echo "Processing Directory $i..."
 j=`basename $i`;
 echo "Base Folder Name is $j"
 date_time="`date +%Y_%m_%d_%H%M%S_`";
 iiq_data_export fsa export -c IsilonCluster1 --data-module directories -o 57467 -r directory:$i -n direxport_$date_time$j.csv
 done

We can now change the permissions and set the file to executable, then run the script.  An output example is below.

[administrator@corporate_iq1 ~]$ chmod 777 direxport.sh
 [administrator@corporate_iq1 ~]$ chmod +X direxport.sh
 [administrator@corporate_iq1 ~]$ ./direxport.sh

Processing NFS_exports/warehouse/warehouse_dec2017/dir_t01...
 Base Folder Name is dir_t01

Successfully exported data to: direxport_2017_01_19_085528_dir_t01.csv

Processing NFS_exports/warehouse/warehouse_dec2017/dir_cat...
 Base Folder Name is dir_cat

Successfully exported data to: direxport_2017_01_19_0855430_dir_cat.csv

Processing NFS_exports/warehouse/warehouse_dec2017/dir_set...
 Base Folder Name is dir_set

Successfully exported data to: direxport_2017_01_19_085532_dir_set.csv

Performance Reporting

As I mentioned at the beginning of this post, this command can also provide performance related information. Below are the default command line options.

usage: iiq_data_export perf list [-h] [--breakouts] [--clusters] [--data-modules]

Options:
 -h, --help Show this help message and exit.

Mutually Exclusive Options:
 --breakouts Displays the names of all breakouts that InsightIQ supports for
 performance data modules. Each data module supports a subset of
 breakouts.
 --clusters Displays the names of all clusters that InsightIQ is monitoring.
 --data-modules Displays the names of all available performance data modules.
 iiq_data_export perf list: error: One of the mutually exclusive arguments are
 required.

Here are the data modules you can export:

 iiq_data_export perf list --data-modules
 ====================================================================
 | Data Module Label                       | Key 
 ====================================================================
 | Active Clients                          | client_active 
 --------------------------------------------------------------------
 | Average Cached Data Age                 | cache_oldest_page_age 
 --------------------------------------------------------------------
 | Average Disk Hardware Latency           | disk_adv_access_latency 
 --------------------------------------------------------------------
 | Average Disk Operation Size             | disk_adv_op_size 
 --------------------------------------------------------------------
 | Average Pending Disk Operations Count   | disk_adv_io_queue 
 --------------------------------------------------------------------
 | Blocking File System Events Rate        | ifs_blocked
 --------------------------------------------------------------------
 | CPU % Use                               | cpu_use 
 --------------------------------------------------------------------
 | CPU Usage Rate                          | cpu_usage_rate 
 --------------------------------------------------------------------
 | Cache Hits                              | cache_hits 
 --------------------------------------------------------------------
 | Cluster Capacity                        | ifs_cluster_capacity 
 --------------------------------------------------------------------
 | Connected Clients                       | client_connected 
 --------------------------------------------------------------------
 | Contended File System Events Rate       | ifs_contended 
 --------------------------------------------------------------------
 | Deadlocked File System Events Rate      | ifs_deadlocked 
 --------------------------------------------------------------------
 | Deduplication Summary (Logical)         | dedupe_logical 
 --------------------------------------------------------------------
 | Deduplication Summary (Physical)        | dedupe_physical 
 --------------------------------------------------------------------
 | Disk Activity                           | disk_adv_busy 
 --------------------------------------------------------------------
 | Disk IOPS                               | disk_iops 
 --------------------------------------------------------------------
 | Disk Operations Rate                    | disk_adv_op_rate 
 --------------------------------------------------------------------
 | Disk Throughput Rate                    | disk_adv_bytes 
 --------------------------------------------------------------------
 | External Network Errors                 | ext_error 
 --------------------------------------------------------------------
 | External Network Packets Rate           | ext_packet 
 --------------------------------------------------------------------
 | External Network Throughput Rate        | ext_net_bytes 
 --------------------------------------------------------------------
 | File System Events Rate                 | ifs_heat 
 --------------------------------------------------------------------
 | File System Throughput Rate             | ifs_total_rate 
 --------------------------------------------------------------------
 | Job Workers                             | worker 
 --------------------------------------------------------------------
 | Jobs                                    | job 
 --------------------------------------------------------------------
 | L1 Cache Throughput Rate                | cache_l1_read 
 --------------------------------------------------------------------
 | L1 and L2 Cache Prefetch Throughput Rate| cache_all_prefetch 
 --------------------------------------------------------------------
 | L2 Cache Throughput Rate                | cache_l2_read 
 --------------------------------------------------------------------
 | L3 Cache Throughput Rate                | cache_l3_read 
 --------------------------------------------------------------------
 | Locked File System Events Rate          | ifs_lock 
 --------------------------------------------------------------------
 | Overall Cache Hit Rate                  | cache_all_read_hitrate 
 --------------------------------------------------------------------
 | Overall Cache Throughput Rate           | cache_all_read 
 --------------------------------------------------------------------
 | Pending Disk Operations Latency         | disk_adv_io_latency
 --------------------------------------------------------------------
 | Protocol Operations Average Latency     | proto_latency 
 --------------------------------------------------------------------
 | Protocol Operations Rate                | proto_op_rate 
 --------------------------------------------------------------------
 | Slow Disk Access Rate                   | disk_adv_access_slow 
 ====================================================================

As an example, if I want to review the CPU utilization for the cluster, I’d type in the command below.   It will show all of the CPU performance information for the specified cluster name.  Once I’ve had more time to dive in to the performance reporting aspect of InsightIQ I’ll revisit and add to this post.

[administrator@corporate_iq1~]$ iiq_data_export perf export -c IsilonCluster1 -d cpu_use

Successfully exported data to: cpu_IsilonCluster1_1515527709.csv

Below is what the output looks like:

[administrator@corporate_iq1 ~]$ cat cpu_STL-Isi0091_1515527709.csv
 Time (Unix) (America/Chicago),cpu (percent)
 1515524100.0,3.77435898780823
 1515524130.0,4.13846158981323
 1515524160.0,3.27435898780823
 1515524190.0,2.34871792793274
 1515524220.0,2.68974351882935
 1515524250.0,3.33333349227905
 1515524280.0,3.02051281929016
 1515524310.0,2.78974366188049
 1515524340.0,2.98717951774597
 <...truncated>

Best Practices for FAST Cache

I recently received a comment asking for more information on EMC’s FAST Cache, specifically about why increased CPU Utilization was observed after a FAST Cache expansion. It’s likely due to the rebuilding of the cache after the expansion and possibly having it enabled on LUNs that shouldn’t, like those with high sequential I/O. It’s hard to pinpoint the exact cause of an issue like that without a thorough analysis of the array itself, however.   I thought I’d do a quick write-up of EMC’s best practices for implementing FAST Cache and the caveats to consider when implementing it.

What is FAST Cache?

First, a quick overview of what it is.  EMC’s FAST Cache uses a RAID set of EFD drives that sits between DRAM Cache and the disks themselves. It holds a large percentage of the most often used data in high performance EFD drives.  It hits a price/performance sweet spot between DRAM and traditional spinning disks for cache, and can greatly increase array performance.

The theory behind FAST Cache is simple:  we divide the array’s storage up in 64KB blocks, we count the number of hits on those blocks, and then we create a cache page on the FAST Cache EFDs if there have been three read (or write) hits on that block.  If FAST Cache fills up, the array will start to seek pages in the EFDs that will make a full stripe write to the spinning disks in the array, and then force flush out to traditional spinning disks.

FAST Cache uses a “three strikes” algorithm.  If you are moving large amounts of data, the FAST Cache algorithm does not activate, which is by design, as cache does not help at all in large copy transactions.  Random hits on active blocks, however,  will ultimately cache those blocks into FAST Cache.  This is where the 64KB granularity makes a difference.  Typical workloads I/O are 64KB or less, and there is a significant chance that even if a workload is performing 4KB reads and writes to different blocks, they will still hit the same 64KB FAST Cache block, resulting in the promotion of that data into FAST Cache.  Cool, right?  It works very well in practice.  With all that said, there are still plenty of implementation considerations for an ideal FAST Cache configuration.  Below is an overview of EMC’s best practices.

Best Practices for LUNs and Pools

  • Only use it where you need it. The FAST Cache driver has to track every I/O to calculate whether a block needs promotion to FAST Cache, which then adds to the SP CPU utilization.  As a best practice, you should disabling FAST Cache for LUNs that won’t need it.  It will cut this overhead and thus can improve overall performance levels.  Having a separate storage pool for LUNs that don’t need FASTCache would be ideal.

Disable FASTCache for the following LUN types:

– Secondary Mirror and Clone destination LUNs
– LUNs with small, high sequential I/O, such as Oracle Database Logs & snapsure dvols
– LUNs in the reserved LUN pool.
– Recoverpoint Journal LUNs
– SnapView Clones and MirrorView Secondary Mirrors

  • Analyze where you need it most.  Based on a workload analysis, I’d consider restricting the use of FAST Cache to the LUNs or Pools that need it the most.  For every new block that is added into FAST Cache, old blocks that are the oldest in terms of the most recent access are removed.  If your FAST Cache capacity is limited, even frequently accessed blocks may be removed before they’re accessed again.
  • Upgrade to the latest OS Release. On the VNX platform, upgrading to the latest FLARE or MCx release can greatly improve the performance of FAST Cache.  It’s been a few years now, but as an example r32 recovers performance much faster after a FAST Cache drive failure compared to r31, as well as automatically avoiding the promotion of small sequential block I/O to FAST Cache.  It’s always a good idea to run a current version of the code.

Best Practices For VNX arrays with MCx:

  • Spread it out. Spread the drives as evenly as possible across the available backend busses.  Be careful, though, as you shouldn’t add more than 8 FAST Cache flash drives per bus including any unused flash drives for use as hot-spares.
  • Always use DAE 0. Try and use DAE 0 on each bus for flash drives as it provides for the lowest latency.

Best Practices for VNX and CX4 arrays with FLARE 30-32: 

  • CX4? No more than 4 per bus. If you’re still using an older CX4 series array, don’t use more than 4 FAST Cache drives per bus, and don’t put all of them on bus 0. If they are all on the same bus, they could completely saturate this bus with I/O.
  • Spread it out. Spread the FAST Cache drives over as many buses as possible. This would especially be an issue if the drives were all on bus 0, because it is used to access the vault drives.  Note that the VNX has six times the back-end bandwidth per bus compared to a CX, so it’s less of a concern.
  • Match the drive sizes. All the drives in FAST Cache must be of the same capacity; otherwise the workload on each drive would rise proportionately with its capacity.  In other words, a 200GB drive would have double the workload of a 100Gb drive.
  • VNX? Use enclosure 0. Put the EFD drives in the first DAE on any bus (i.e. Enclosure 0).  The I/O has to pass through the LCC of each DAE between the drive and the SP, and each extra LCC it passes through will add a small amount of latency. The latency would normally be negligible, but is significant for flash drives.  Note that on the CX4, all I/O has to pass through every LCC anyway.
  • Mind the order the disks are added.  The order the drives are added dictates which drives are primary & secondary. The first drive added is the primary for the first mirror, the next drive added is its secondary for the first mirror, the third drive is the primary for the second mirror, etc.
  • Location, Location, Location. It’s a more advanced configuration and requires the use of the CLI, but for highest availability place the primary and secondary for each FAST Cache RAID 1 pair are on different buses.

 

 

 

 

Using Cron with EMC VNX and Celerra

I’ve shared numerous shell scripts over the years on this blog, many of which benefit from being scheduled to run automatically on the Control Station.  I’ve received emails and comments asking “How do I schedule Unix or Linux crontab jobs to run at intervals like every five minutes, every ten minutes, Every hour, etc.”?  I’ll run through some specific examples next. While it’s easy enough to simply type “man crontab” from the CLI to review the syntax, it can be helpful to see specific examples.

What is cron?

Cron is a time-based job scheduler used in most Unix operating systems, including the VNX File OS (or DART). It’s used schedule jobs (either commands or shell scripts) to run periodically at fixed times, dates, or intervals. It’s primarily used to automate system maintenance, administration, and can also be used for troubleshooting.

What is crontab?

Cron is driven by a crontab (cron table) file, a configuration file that specifies shell commands to run periodically on a given schedule. The crontab files are stored where the lists of jobs and other instructions to the cron daemon are kept. Users can have their own individual crontab files and often there is a system-wide crontab file (usually in /etc or a subdirectory of /etc) that only system administrators can edit.

On the VNX, the crontab files are located at /var/spool/cron but are not intended to be edited directly, you should use the “crontab –e” command.  Each NAS user has their own crontab, and commands in any given crontab will be executed as the user who owns the crontab.  For example, the crontab file for nasadmin is stored as /var/spool/cron/nasadmin.

Best Practices and Requirements

First, let’s review how to edit and list crontab and some of the requirements and best practices for using it.

1. Make sure you use “crontab -e” to edit your crontab file. As a best practice you shouldn’t edit the crontab files directly.  Use “crontab -l” to list your entries.

2. Blank lines and leading spaces and tabs are ignored. Lines whose first non-space character is a pound sign (#) are comments, and are ignored. Comments are not allowed on the same line as cron commands as they will be taken to be part of the command.

3. If the /etc/cron.allow file exists, then you must be listed therein in order to be allowed to use this command. If the /etc/cron.allow file does not exist but the /etc/cron.deny file does exist, then you must not be listed in the /etc/cron.deny file in order to use this command.

4. Don’t execute commands directly from within crontab, place your commands within a script file that’s called from cron. Crontab cannot accept anything on stdout, which is one of several reasons you shouldn’t put commands directly in your crontab schedule.  Make sure to redirect stdout somewhere, either a log file or /dev/null.  This is accomplished by adding “> /folder/log.file” or “> /dev/null” after the script path.

5. For scripts that will run under cron, make sure you either define actual paths or use fully qualified paths to all commands that you use in the script.

6. I generally add these two lines to the beginning of my scripts as a best practice for using cron on the VNX.

export NAS_DB=/nas
export PATH=$PATH:/nas/bin

Descriptions of the crontab date/time fields

Commands are executed by cron when the minute, hour, and month of year fields match the current time, and when at least one of the two day fields (day of month, or day of week) match the current time.

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of week (0 - 7) 
# │ │ │ │ │                          
# │ │ │ │ │
# │ │ │ │ │
# * * * * *  command to execute
# 1 2 3 4 5  6

field #   meaning          allowed values
-------   ------------     --------------
   1      minute           0-59
   2      hour             0-23
   3      day of month     1-31
   4      month            1-12
   5      day of week      0-7 (0 or 7 is Sun)

Run a command every minute

While it’s not as common to want to run a command every minute, there can be specific use cases for it.  It would most likely be used when you’re in the middle of troubleshooting an issue and need data to be recorded more frequently.  For example, you may want to run a command every minute to check and see if a specific process is running.  To run a Unix/Linux crontab command every minute, use this syntax:

# Run “check.sh” every minute of every day
* * * * * /home/nasadmin/scripts/check.sh

Run a command every hour

The syntax is similar when running a cron job every hour of every day.  In my case I’ve used hourly scripts for performance monitoring, for example with the server_stats VNX script. Here’s a sample crontab entry that runs at 15 minutes past the hour, 24 hours a day.

# Brocade Backup
# This command will run at 12:15, 1:15, 2:15, etc., 24 hours a day.
15 * * * * /home/nasadmin/scripts/stat_collect.sh

Run a command once a day

Here’s an example that shows how to run a command from the cron daemon once a day. In my case, I’ll usually run daily commands for report updates on our web page and for backups.  As an example, I run my Brocade Zone Backup script once daily.

# Run the Brocade backup script at 7:30am
30 7 * * * /home/nasadmin/scripts/brocade.sh

Run a command every 5 minutes

There are multiple methods to run a crontab entry every five minutes.  It is possible to enter a single, specific minute value multiple times, separated by commas.  While this method does work, it makes the crontab list a bit harder to read and there is a shortcut that you can use.

0,5,10,15,20,25,30,35,40,45,50,55  * * * * /home/nasadmin/scripts/script.sh

The crontab “step value” syntax (using a forward slash) allows you use a crontab entry in the format sample below.  It will run a command every five minutes and accomplish the same thing as the command above.

# Run this script every 5 minutes
*/5 * * * * /home/nasadmin/scripts/test.sh

Ranges, Lists, and Step Values

I just demonstrated the use of a step value to specify a schedule of every five minutes, but you can actually get even more granular that that using ranges and lists.

Ranges.  Ranges of numbers are allowed (two numbers separated with a hyphen). The specified range is inclusive. For example, using 7-10 for an “hours” entry specifies execution at hours 7, 8, 9, & 10.

Lists. A list is a set of numbers (or ranges) separated by commas. Examples: “1,2,5,9”, “0-4,8-12”.

Step Values. Step values can be used in conjunction with ranges. Following a range with “/” specifies skips of the number’s value through the range. For example, “0-23/2” can be used in the hours field to specify command execution every other hour (the alternative being “0,2,4,6,8,10,12,14,16,18,20,22”). Steps are also permitted after an asterisk, so if you want to say “every two hours” you can use “*/2”.

Special Strings

While I haven’t personally used these, there is a set of built in special strings you can use, outlined below.

string         meaning
------         -------
@reboot        Run once, at startup.
@yearly        Run once a year, "0 0 1 1 *".
@annually      (same as @yearly)
@monthly       Run once a month, "0 0 1 * *".
@weekly        Run once a week, "0 0 * * 0".
@daily         Run once a day, "0 0 * * *".
@midnight      (same as @daily)
@hourly        Run once an hour, "0 * * * *".

Using a Template

Below is a template you can use in your crontab file to assist with the valid values that can be used in each column.

# Minute|Hour  |Day of Month|Month |WeekDay |Command
# (0-59)|(0-23)|(1-31)      |(1-12)|(0-7)             
  0      2      12           *      *        test.sh

Gotchas

Here’s a list of the known limitations of cron and some of the issues you may encounter.

1. When cron job is run it is executed as the user that created it. Verify security requirements for the job.

2. Cron jobs do not use any files in the user’s home directory (like .cshrc or .bashrc). If you need cron to read any file that your script will need, you will need to call it from the script cron is using. This includes setting paths, sourcing files, setting environment variables, etc.

3. If your cron jobs are not running, make sure the cron daemon is running. The cron daemon can be started or stopped with the following VNX Commands (run as root):

# /sbin/service crond stop
 # /sbin/service crond start

4.  If your job isn’t running properly you should also check the /etc/cron.allow and /etc/cron.deny files.

5. Crontab is not parsed for environmental substitutions. You can not use things like $PATH, $HOME, or ~/sbin.

6. Cron does not deal with seconds, minutes is the most granular it allows.

7. You can not use % in the command area. They will need to be escaped and if used with command substitution like the date command you can put it in backticks. Ex. `date +\%Y-\%m-\%d`. Or use bash’s command substitution $().

8. Be cautious using the day of the month and the day of week together.  The day of month and day of week fields with restrictions (no *) makes this an “or” condition not an “and” condition.  When either field is true it will be executed.

 

XtremIO Manual Log File Collection Procedure

If you have a need to gather XtremIO logs for EMC to analyze and they are unable to connect via ESRS, there is a method to gather them manually.  Below are the steps on how to do it.

1. Log in to the XtremIO Management System (XMS) GUI interface with the ‘admin‘ user account.

2. Click on the ‘Administration‘ tab, which is on the top of the XtremIO Management System (XMS) GUI banner bar.

3. On the left side of the Administration window, choose the ‘CLI Terminal‘ option.

4. Once you have the CLI terminal up, enter the following CLI command at the ‘xmcli (admin)>‘ prompt.  This command will generate a set of XtremIO dossier log files: create-debug-info.  Note that it may take a little while to complete.  Once the command completes and returns you to the ‘xmcli (admin)>’ prompt, a complete package of XtremIO dossier log files will be available for you to download.

Example:

xmcli (admin)> create-debug-info
The process may take a while. Please do not interrupt.
Debug info collected and could be accessed via http:// <XMS IP Address> /XtremApp/DebugInfo/104dd1a0b9f56adf7f0921d2f154329a.tar.xz

Important Note: If you have more than one cluster managed by the XMS server, you will need to select the specific cluster.

xmcli (e012345)> show-clusters

Cluster-Name Index State  Gates-Open Conn-State Num-of-Vols Num-of-Internal-Volumes Vol-Size UD-SSD-Space Logical-Space-In-Use UD-SSD-Space-In-Use Total-Writes Total-Reads Stop-Reason Size-and-Capacity

XIO-0881     1     active True       connected  253         0                       60.550T  90.959T      19.990T              9.944T              44

2.703T     150.288T    none        4X20TB

XIO-0782     2     active True       connected  225         0                       63.115T  90.959T      20.993T              9.944T              20

7.608T     763.359T    none        4X20TB

XIO-0355     3     active True       connected  6           0                       2.395T   41.111T      1.175T               253.995G            6.

251T       1.744T      none        2X40TB

xmcli (e012345)> create-debug-info cluster-id=3

5. Once the ‘create-debug-info‘ command completes, you can use a web browser to navigate to the HTTP address link that’s provided in the terminal session window.  After navigating to the link, you’ll be presented with a pop-up window asking you to save the log file package to your local machine.  Save the log file package to your local machine for later upload.

6. Attach the XtremIO dossier log file package you downloaded to the EMC Service Request (SR) you currently have open or are in the process of opening.  Use the ‘Attachments’ (the paperclip button) area located on the Service Request page for upload.

7. You also have the ability to view a historical listing of all XtremIO dossier log file packages that are currently available on your system. To view them, issue the following XtremIO CLI command: show-debug-info. A series of log file packages will be listed.  It’s possible EMC may request a historical log file package for baseline information when troubleshooting.  To download, simply reference the HTTP links listed under the ‘Output-Path‘ header and input the address into your web browser’s address bar to start the download.

Example:

xmcli (tech)> show-debug-info
 Name  Index  System-Name   Index   Debug-Level   Start-Time                 Create-Time               Output-Path
 1      XtremIO-SVT   1       medium        Mon Aug 14 15:55:10 2017   Mon Aug 14 16:09:40 2017  http://<XMS IP Address>/XtremApp/ DebugInfo/1aaf4b1acd88433e9aca5b022b5bc43f.tar.xz
 2      XtremIO-SVT   1       medium        Mon Aug 14 15:55:10 2017   Mon Aug 14 16:09:40 2017  http://<XMS IP Address>/XtremApp/ DebugInfo/af5001f0f9e75fdd9c0784c3d742531f.tar.xz

That’s it! It’s a fairly straightforward process.

 

 

Isilon Mitrend Data Gathering Procedure

Mitrend is an extremely useful IT Infrastructure analysis service. They provide excellent health, growth and workload profiling assessments.  The service can process input source data from EMC and many non-EMC arrays, from host operating systems, and also from some applications.  In order to use the service, certain support files must be gathered before submitting your analysis request.  I had previously run the reports myself as an EMC customer, but sometime in the recent past they removed that ability for customers and it is now restricted to EMC employees and partners. You can of course simply send the files to your local EMC support team and they will be able to submit the files for a report on your behalf.  The reports are very detailed and extremely helpful for a general health check of your array, data is well organized into a powerpoint slide presentation and raw data is also made available in excel format.

My most recent analysis request was for Isilon, and below are the steps you’ll need to take to gather the appropriate information to receive your Isilon Mitrend report.  The performance impact of running the data gather is expected to be minimal, but in situations where the performance impact may be a concern then you should consider the timing of the run. I have never personally had an issue with performance when running the data gather, and the performance data is much more useful if it’s run during peak periods. The script is compatible with the virtual OneFS Simulator and can be executed and can be tested prior to running on any production cluster. If you notice performance concerns while the script is running, pressing Control + C in the console window will terminate it.

Obtain & Verify isi_gather_perf file

You will need to obtain a copy of the isi_gather_perf.tgz file from your local EMC team if you don’t already have a copy.  Verify that the file you receive file is 166 KB in size. To verify the isi_gather_perf.tgz is not corrupted or truncated you can run the following command once the file is on the Isilon cluster.

Isilon-01# file /ifs/isi_gather_perf.tgz

Example of a good file:

Isilon-01# file /ifs/isi_gather_perf.tgz /ifs/isi_gather_perf.tgz:
gzip compressed data, from Unix, last modified: Tue Nov 18 08:33:49 2014
data file is ready to be executed

Example of a corrupt file:

Isilon-01# file /ifs/isi_gather_perf.tgz /ifs/isi_gather_perf.tgz:
data file is corrupt

Once you’ve verified that the file is valid, you must manually run a Cluster Diagnostics gather. On the OneFS web interface, navigate to Cluster Management > Diagnostics > Gather Info and click the “Start Gather” button. Depending on the size of the cluster, it will take about 15 minutes. This process will automatically create a folder on the cluster called “Isilon_Support”, created under “ifs/data/”.

Gather Performance Info

Below is the process that I used.  Different methods of transferring files can of course be used, but I use WinSCP to copy files directly to the cluster from my Windows laptop, and I use putty for CLI management of the cluster via ssh.

1. Copy the isi_gather_perf.tgz to the Isilon cluster via SCP.

2.  Log into the cluster via ssh.

3. Copy the isi_gather_perf.tgz to /ifs/data/Isilon_Support, if it’s not there already.

4. Change to the Isilon Support Directory

 Isilon-01# cd /ifs/data/Isilon_Support

5. Extract the compressed file

 Isilon-01# tar zxvf /ifs/data/Isilon_Support/isi_gather_perf.tgz

After extraction, a new directory will be automatically created within the “Isilon_Support” directory named “isi_gather_perf”.

6. Start ‘Screen’

 Isilon-01# screen

7.  Execute the performance gather.  All output data is written to /ifs/data/Isilon_Support/isi_gather_perf/.  Extracting the file creates a new directory named “isi_gather_perf” which contains the script “isi_gather_perf”.  The default option gathers 24 hours of performance data and then creates a bundle with the gathered data.

Isilon-01# nohup python /ifs/data/Isilon_Support/isi_gather_perf/isi_gather_perf

8. At the end of the run, the script will create a .tar.gz archive of the capture data to /ifs/data/Isilon_Support/isi_gather_perf/. Gather the output files and send them to EMC.  Once EMC submits the files to Mitrend, it can take up to 24 hours for them to be processed.

Notes:

Below is a list of the command options available.  You may want to change the frequency the command is executed and the length of time the command is run with the I and r options.

 Usage: isi_gather_perf [options]

 Options:
 -h, --help Show this help message and exit
 -v, --version Print Version
 -d, --debug Enable debug log output Logs: /tmp/isi_gather_perf.log
 -i INTERVAL, --interval=INTERVAL
 Interval in seconds to sample performance data
 -r REPEAT, --repeat=REPEAT
 Number of times to repeat specified interval.

Logs:

The logs are located in /ifs/data/Isilon_Support/isi_gather_perf/gathers/ and by default are set to debug level, so they are extremely verbose.

Output:

The output from isi_gather_info will go to /ifs/data/Isilon_Support/pkg/
The output from isi_gather_perf will be /ifs/data/Isilon_Support/isi_gather_perf/gathers/

 

 

 

 

 

 

 

Mutiprotocol VNX File Systems: Listing and counting Shares & Exports by file system

I’m in the early process of planning a NAS data migration from VNX to Isilon, and one of the first steps I wanted to accomplish was to identify which of our VNX file systems are multiprotocol (with both CIFS shares and NFS exports from the same file system). In the environment I support, which has over 10,000 cifs shares, it’s not a trivial task to identify which shares are multiprotocol.  After some research it doesn’t appear that there is a built in method from EMC for determining this information from within the Unisphere GUI. From the CLI, however, the server_export command can be used to view the shares and exports.

Here’s an example of listing shares and exports with the server_export command:

[nasadmin@VNX1 ~]$server_export ALL -Protocol cifs -list | grep filesystem01

share "share01$" "/filesystem01/data" umask=022 maxusr=4294967294 netbios=NASSERVER comment="Contact: John Doe"
 
[nasadmin@VNX1 ~]$server_export ALL -Protocol nfs -list | grep filesystem01

export "/root_vdm_01/filesystem01/data01" rw=admins:powerusers:produsers:qausers root=storageadmins access=admins:powerusers:produsers:qausers:storageadmins
export "/root_vdm_01/filesystem01/data02" rw=admins:powerusers:produsers:qausers root=storageadmins access=admins:powerusers:produsers:qausers:storageadmins

The output above shows me that the file system named “filesystem01” has one cifs share and two NFS exports.  That’s a good start, but I want to get a count of the number of shares and exports rather than a detailed list of them. I can accomplish that by adding ‘wc’ [word count] to the command:

[nasadmin@VNX1 ~]$ server_export ALL -Protocol cifs -list | grep filesystem01 | wc
 1 223 450

[nasadmin@VNX1 ~]$ server_export ALL -Protocol nfs -list | grep filesystem01 | wc
 2 15 135

That’s closer to what I want.  The output includes three numbers and the first number is the line count.  I really only want that number, so I’ll just grab it with awk. Ultimately I want the output to go to a single file with each line containing the name of
the file system, the number of CIFS shares, and the number of NFS Exports.  This line of code will give me what I want:

[nasadmin@VNX1 ~]$ echo -n "filesystem01",`server_export ALL -Protocol cifs -list | grep filesystem01 | wc | awk '{print $1}'`, `server_export ALL -Protocol nfs -list | grep filesystem01 | wc | awk '{print $1}'` >> multiprotocol.txt ; echo 
" " >> multiprotocol.txt

The output is below.  It’s perfect, as it’s in the format of a comma delimited file and can be easily exported into Microsoft Excel for reporting purposes.

filesystem01, 1, 2

Here’s a more detailed explanation of the command:

echo -n “filesystem01”, : Echo will write the name of the file system to the screen or to a file if you’ve redirected it with “>” at the end of the command.  Adding the “-n” supresses the “new line” that is automatically created after text is outputted, as I want each file system and it’s share & export count to be on the same line in the report.

`server_export ALL -Protocol cifs -list | grep filesystem01 | wc | awk ‘{print $1}’`,: The server_export command lists all of the cifs shares for the file system that you’re grepping for.  The wc command is for the “word count”, we’re using it to count the number of output lines to verify how many exports exist for the specified file system.  The awk ‘{print $1}’ command will output only the first item of data, when it hits a blank space it will stop.  If the output is “1 23 34 32 43 1”, running ‘{print $1}’ will only output the 1.

`server_export ALL -Protocol nfs -list | grep filesystem01 | wc | awk ‘{print $1}’` >> multiprotocol.txt: This is the same command as above, but we’re now counting the number of NFS exports rather than CIFS shares.

; echo ” ” >> multiprotocol.txt:  After the count is complete and the data has been outputted, I want to run an echo command without the “-n” option to force a line break to the next line, in preparation for the next line of the script.  When exporting, using “>” will output the results to a file and overwrite the file if it already exists, if you use “>>”, it will append the results to the file if it already exists.  In this case we want to append each line.  In an actual script you’d want to create a blank file first with “echo > filename.ext”. Also, the “;” prior to the command instructs the interpreter to start a new command regardless of the success or failure of the prior command.

At this point, all that needs to be done is to create a script that includes the line above with every file system on the VNX. I copied the line of code above into excel into multiple columns, allowing me to copy and paste the file system list from Unisphere and then concatenate the results into a single script file. I’m including a screenshot of one of my script lines from Excel as an example.  The final column (AG) has the following formula:

=CONCATENATE(A4,B4,C4,D4,E4,F4,G4,H4,I4,J4,K4,L4,M4,N4,O4,P4,Q4,R4,S4,T4,U4,V4,W4,X4,Y4,Z4,AA4,AB4,AC4,AD4,AE4)

Spreadsheet example:

multiprotocol count

 

Generating and installing SSL requests, keys, and certificates on EMC ECS

ecshttps

In this post I’ve outlined the procedure for generating SSL requests, keys and certificates for ECS, as well as outlining the process for uploading them to ECS and verifying the installed certificates afterwards.   This was a new process for me so I created very detailed documentation on the process I used, hopefully this will help someone else out.

I mention using the ECS CLI a few times in this document.  If you’d like to use the ECS CLI, I have another blog post here that reviews the details on it’s implementation.  It requires Python.

Part 1: Generating SSL requests, Keys, and Certificates.

The procedure for generating SSL requests, keys, and certificates is unnecessary if you will be given the certificate and key files from a trusted source within your organization.  If you’ve been provided the certificate and key file already, you can skip to the Part 2 that details how to upload and import the keys and certificates to ECS.  This is of course a sample procedure on how I did it in my organization, specific details may have to be altered depending on the use case.

a.       Prepare for Creating (or editing) the SSL Request file

  • The first step in this process is to generate an SSL request file.  As OpenSSL does not allow you to pass Subject Alternative Names (SANs) through the command line, they must be added to a configuration file first.
  • On ECS, the OpenSSL configuration file is located at /etc/ssl/oenssl.cnf by default.  Copy that file to a temporary directory where you will be generating your certificates.
  • Run this command to copy the request file for editing:
admin@ecs-node1:~# cp /etc/ssl/openssl.cnf /tmp/request.conf

b.      Make changes to the request.conf file.  Edit it with vi and make the edits outlined below.  Each bullet reviews a specific section of the file where changes are required.

  • [ alternate_names ] Edit the [ alternate_names ] section.  In a typical request file these are included at the very end of the configuration file.  Note that this request example includes the wildcard as the first entry (which is required by S3).

Sample:

DNS.1 = *.prod.os.example.com
DNS.2 = atmos.example.com
DNS.3 = swift.example.com
  • [ v3_ca ]  Edit the [ v3_ca ] section.

This line should be added directly below the [ v3_ca ] header:

subjectAltName = @alternate_names

Search for “basicConstraints” in the [ v3_ca ] section.  You may see “basicConstraints = CA:true”.  Make sure it is commented out – add the # to the beginning of the line.

# basicConstraints = CA:true

Search for “keyUsage = cRLSign, keyCertSign” in the [ v3_ca ] section.  You may see “# keyUsage = cRLSign, keyCertSign”.  Make sure it is commented out.

# keyUsage = cRLSign, keyCertSign
  • [ v3_req ] Verify the configuration in the [ v3_req ] section.  The line below must exist.
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
  • [ usr_cert ] Verify the configuration in the [ usr_cert ] section.

Search for the entry below and uncomment it, it should be added.

extendedKeyUsage = serverAuth

The following line is likely to already exist in this [ v3_ca ] section.  The authorityKeyIdentifier line exists in multiple locations in the config file, however in the v3_ca section it must have “always,issuer” as its option.

# authorityKeyIdentifier=keyid:always,issuer
  • [ req ] Verify the configuration In the [ req ] section.

For our dev environment, in the testing phase with a self-signed certificate, the following entry was made six lines below the [ req ] header:

x509_extensions = v3_ca         # The extensions to add to the self-signed cert

The x509_extensions line also exists in the [ CA_default ] section.  This was left untouched in my configuration.

x509_extensions = usr_cert      # The extensions to add to the cert

Change based on certificate type.  Note that this will change if you’re not using a self-signed certificate, which I did not test.  The req_extensions line exists in the default configuration file and is commented out.

x509_extensions = v3_ca           #  for a self-signed cert
req_extensions = v3_ca              # for cert signing req

Change the default_bits entry.

Search for default_bits = 1024, it should be default_bits = 2048
  • [ CA_default ]  In the CA_default section, uncomment or add the line below.  The line exists in the default configuration file and simply needs to be uncommented.
copy_extensions = copy

The following additional changes were made in my configuration:

Search for dir = ./demoCA, change to dir = /etc/pki/CA
Search for default_md = default, change to default_md = sha256
  • [ req_distinguished_name] Verify the configuraiton in the [ req_distinguished_name] section.

The following changes were made in my configuration:

countryName_default = AU, change to countryName_default = XX
stateOrProviceName_default = SomeState, change to stateOrProviceName_default = Default Province
localityName_default doesn’t exist in the default file, added as localityName_default = Default City
0.organizationName_default = Internet Widgits Pty Ltd, change to 0.organizationName_default = Default Company
commonName = Common Name (e.g. server FQDN or YOUR name), it was changed to commonName = Common Name (eg, your name or your server\'s hostname)
  • [ tsa_config1 ] Verify the configuration in the [ tsa_config1 ] section.

The following additional change was made in my configuration:

digests = md5, sha1, change to digests = sha1, sha256, sha384, sha512

c.       Generate the Private Key.  Save the key file in a secure location, the security of your certificate depends on the private key.

  • Run this command to generate the private key:
admin@ecs-node1:~# openssl genrsa -des3 -out server.key 2048
Generating RSA private key, 2048 bit long modulus
............................................................+++
Enter pass phrase for server.key: <enter a password>
Verifying - Enter pass phrase for server.key: <enter a password>
  • Modify the permissions of the server key:
admin@ecs-node1:~# chmod 0400 server.key
  • Now that the private key is generated, you can either create a certificate request (the .req file) to request a certificate from a CA or generate a self-signed certificate.  In the samples below, I’m setting the Common Name (CN) on the certificate to *.os.example.com.

d.      Generate the Certificate Request.  Next we will look at the steps used to generate a certificate request.

  • Run the command below to generate the request.
admin@ecs-node1:~# openssl req -new -key server.key -config request.conf -out server.csr
Enter pass phrase for server.key: <your passprhase from above>
  • Running the command above will prompt for additional information that will be incorporated into the final certificate request (the Distinguished Name, or DN). Some fields may be left blank and some will have default values, If you enter ‘.’ the field will be left blank.
Country Name (2 letter code) [US]: <Enter value>
State or Province Name (full name) [Province]: <Enter value>
Locality Name (eg, city) []: <Enter value>
Organization Name (eg, company) [Default Company Ltd]: <Enter value>
Organizational Unit Name (eg, section) []: <Enter value>
Common Name (e.g. server FQDN or YOUR name) []: <*.os.example.com>
Email Address []: <admin email>
  • Enter the following extra attributes to be sent with the certificate request:
A challenge password []: <optional>
An optional company name []: <optional>
  • Check request contents.  Use OpenSSL to verify the contents of the request, verify that the SANs are set correctly.
admin@ecs-node1:~# openssl req -in server.csr -text -noout
Certificate Request:
Data:
Version: 0 (0x0)
Subject: C=US, ST=North Dakota, L=Fargo, O=EMC, OU=ASD,
CN=*.os.example.com/emailAddress=admin@example.com
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:vc
a7:5a:dc:ca:ff:73:53:6b:ab:a7:ff:7a:20:c1:ff:
   … <removed a portion of the output for this example> ..
ff:9e:66:ff:43:0a:fd:31:3d:69:b1:03:20:51:ff:
Exponent: 65537 (0x10001) A
Requested Extensions:
X509v3 Subject Alternative Name:
DNS:os.example.com, DNS:atmos.example.com, DNS:swift.example.com
X509v3 Basic Constraints:
CA:FALSE
X509v3 Key Usage:
Digital Signature, Non Repudiation, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication attributes:
Signature Algorithm: sha256 WithRSAEncryption
ff:7a:f3:7d:8e:8d:37:8f:66:c8:91:16:c0:00:39:df:03:c1:
… <removed a portion of the output for this example> ..
ff:d9:68:ff:be:e4:4e:e1:78:16:67:47:14:01:31:32:0e:a2:
  • Now that the certificate request is completed it may be submitted to the CA who will then return a signed certificate file.

e.      Generate a Self-Signed Certificate.  Generating a self-signed certificate is almost identical to generating the certificate request. The main difference is that instead of generating a request file, you add an -x509 argument to to the openssl req command to generate a certificate file instead.

admin@ecs-node1:~#  openssl req -x509 -new -key server.key -config request.conf -out server.crt
Enter pass phrase for server.key: <your passprhase from above>
  • Running that command will prompt for additional information that will be incorporated into the certificate request.  This is called a Distinguished Name (DN). Some fields may be left blank and some will have default values, If you enter ‘.’ the field will be left blank.
Country Name (2 letter code) [US]: <Enter value>
State or Province Name (full name) [Province]: <Enter value>
Locality Name (eg, city) []: <Enter value>
Organization Name (eg, company) [Default Company Ltd]: <Enter value>
Organizational Unit Name (eg, section) []: <Enter value>
Common Name (e.g. server FQDN or YOUR name) []: <*.os.example.com>
Email Address []: <admin email>
  • Enter the following extra attributes to be sent with the certificate request:
A challenge password []: <optional>
An optional company name []: <optional>
  • Check request contents.  Use OpenSSL to verify the contents of the request, verify that the SANs are set correctly.
admin@ecs-node1:~# openssl x509 -in server.crt -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 9999999999990453326 (0x11fc66cf7c09d762)
Signature Algorithm: sha256WithRSAEncryption
Issuer: C=US, ST=North Dakota, L=Fargo, O=EMC, OU=ASD, CN=*.os.example.com/
emailAddress=admin@example.com
Validity
Not Before: Oct 14 16:47:40 2014 GMT
Not After : Nov 13 16:47:40 2014 GMT
Subject: C=US, ST=Minnesota, L=Minneapolis, O=EMC, OU=ASD,
CN=*.os.example.com/emailAddress=admin@example.com
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
ff:bc:8f:83:7b:57:72:3d:70:ef:ff:d0:f9:97:ff:
   … <removed a portion of the output for this example> ..
ff:9e:66:86:43:0a:fd:ff:3d:69:b1:03:20:51:ff:
db:77
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Extended Key Usage:
TL Web Server Authentication
X509v3 Subject Alternative Name:
DNS:os.example.com, DNS:atmos.example.com, DNS:swift.example.com
X509v3 Basic Constraints:
CA:FALSE
X509v3 Key Usage:
Digital Signature, Non Repudiation, Key Encipherment
Signature Algorithm: sha256WithRSAEncryption
ff:bc:8f:83:7b:57:72:ff:70:ef:b9:d0:f9:97:ff:
   … <removed a portion of the output for this example> ..
ff:9e:66:ff:43:0a:fd:31:3d:69:ff:03:20:51:39:
db:77

f.        Chain File. In either a self-signed or a CA signed use case, you now have a certificate file.  In the case of a self-signed certificate, the certificate is the chain file.  If your certificate was signed by a CA, you’ll need to append the intermediate CA cert(s) to your certificate.  I used a self-signed certificate in my implementation and did not perform this step.

  • Append the CA cert if it was signed by a CA.  Do not append the root CA certificate:
admin@ecs-node1:~# cp server.crt serverCertChain.crt
admin@ecs-node1:~# cat intermediateCert.crt >> serverCertChain.crt

 

Part 2: Upload the keys and Certificates.  The next section outlines the process for installing the key and certificate pair on ECS.

a.       First log in to the management API to get a session token.  You will need the root password for the ECS node.

  • Run this command (change IP and password as needed): (ctrl+c to break)
admin@ecs-node1:/> curl -L --location-trusted -k https://10.10.10.10:4443/login -u "root:password" –v
  • Note that the prior will leave the root password in the command history.  You can run it without the password and have it prompt you instead:
curl -L --location-trusted -k https://10.10.10.10:4443/login -v -u root
Enter host password for user 'root': <enter password>
  • From the output of the command above, set an environment variable to hold the token for later use.
admin@ecs-node1:/> export ECS_TOKEN=x-sds-auth-token-value

b.      Commands used for installing a key & certificate pair for Management requests/users:

  • Use ECSCLI to run it from a client PC:
admin@ecs-node1:/> python ecscli.py vdc_keystore update –hostname <ecs host ip> -port 4443 –cf <cookiefile> –privateKey <privateKey> -certificateChain <certificateChainFile>
  • Use CURL to run it directly from the ECS management console.  Note that this command uses the TOKEN environment variable that was set earlier.

Sample Command:

admin@ecs-node1:/> curl -svk -H "X-SDS-AUTH-TOKEN: $TOKEN" -H "Content-type: application/xml" -H "X-EMC-REST-CLIENT: TRUE" -X PUT -d "<rotate_keycertchain><key_and_certificate><private_key>`cat privateKeyFile`</private_key><certificate_chain>`cat certChainFile`</certificate_chain></key_and_certificate></rotate_keycertchain>" https://localhost:4443/vdc/keystore

c.       Commands used for installing a key & certificate pair for Object requests/users.  Use the actual private key and certificate chain files here, and a successful response code should be an HTTP 200.

  • Use ECSCLI to run it from a client PC:
admin@ecs-node1:/> python ecscli.py keystore update –hostname <ecs host ip> -port 4443 –cf <cookiefile> -pkvf <privateKey> -cvf <certificateChainFile>
  • Use CURL to run it directly from the ECS management console.  If curl is used, the xml format is required so that carriage returns and the like will be handled via the `cat` command.

Sample Command:

admin@ecs-node1:/> curl -svk -H "X-SDS-AUTH-TOKEN: $TOKEN" -H "Content-type: application/xml" -H "X-EMC-REST-CLIENT: TRUE" -X PUT -d "<rotate_keycertchain><key_and_certificate><private_key>`cat privateFile.key`</private_key><certificate_chain>`cat certChainFile.pem`</certificate_chain></key_and_certificate></rotate_keycertchain>" https://localhost:4443/object-cert/keystore

d.      Important Notes:

  • Though this is the object certificate to be used for object requests sent on port 9021, the upload command is a management command which is sent on port 4443.
  • Once this is done it can take up to 2 hours for the certificate to be distributed to all of the nodes.
  • The certificate is immediately distributed upon the service restart of the node where the certificate was uploaded.

e.      Restart managment services to propagate the management certificate.  Using viprexec will run the command on all of the nodes in the cluster.

admin@ecs-node1:/> sudo -i viprexec -i -c '/etc/init.d/nginx restart;sleep 10;/etc/init.d/nginx status'

Output from host : 192.168.1.1
Stopping nginx service ..done
Starting nginx service
..done
nginx service is running (pid=75447)

Output from host : 192.168.1.2
Stopping nginx service ..done
Starting nginx service
..done
nginx service is running (pid=85534)

Output from host : 192.168.1.3

Stopping nginx service ..done
Starting nginx service
..done
nginx service is running (pid=87325)

Output from host : 192.168.1.4
Stopping nginx service ..done
Starting nginx service
..done
nginx service is running (pid=59112)

Output from host : 192.168.1.5
Stopping nginx service ..done
Starting nginx service
..done
nginx service is running (pid=77312)

f.        Verify that the certificate was propogated to each node.  The output will show the certificate, scroll up and verify all of the information is correct.  At the minimum the first and last node should be checked.

admin@ecs-node1:/> openssl s_client -connect 10.10.10.1:4443 | openssl x509 -noout -text 
admin@ecs-node1:/> openssl s_client -connect 10.10.10.2:4443 | openssl x509 -noout -text 
admin@ecs-node1:/> openssl s_client -connect 10.10.10.3:4443 | openssl x509 -noout -text 
admin@ecs-node1:/> openssl s_client -connect 10.10.10.4:4443 | openssl x509 -noout -text 
admin@ecs-node1:/> openssl s_client -connect 10.10.10.5:4443 | openssl x509 -noout -text

g.       Wait at least 2 minutes and then restart the object head services to propagate the object head certificate:

admin@ecs-node1:/> sudo -i viprexec -i -c 'kill \`pidof dataheadsvc\`'
  • Wait for the service to come back up, which you can verify with the next few commands.
  • Run netstat to verify the datahead service is listening.
admin@ecs-node1:/tmp> netstat -an | grep LIST | grep 9021
tcp        0      0 10.10.10.1:9021     :::*    LISTEN
admin@ecs-node1:/tmp> sudo netstat -anp | grep 9021
tcp  0  0 10.10.10.1:9021 :::* LISTEN 67064/dataheadsvc
  • You can run the ps command to verify the start time of the datahead service compared to the current time on the node.
admin@ecs-node1:/tmp> ps -ef | grep dataheadsvc
storage+  29052  11163  0 May19 ? 00:00:00 /opt/storageos/bin/monitor -u 444 -g 444 -c / -l /opt/storageos/logs/dataheadsvc.out -p /var/run/dataheadsvc.pid /opt/storageos/bin/dataheadsvc file:/opt/storageos/conf/datahead-conf.xml
storage+  57064  29052 88 20:27 ? 00:00:51 /opt/storageos/bin/dataheadsvc -ea -server -d64 -Xmx9216m -Dproduct.home=/opt/storageos -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/storageos/logs/dataheadsvc-78517.hprof -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Dlog4j.configurationFile=file:/opt/storageos/conf/dataheadsvc-log4j2.xml -Xmn2560m -Dsun.net.inetaddr.ttl=0 -Demc.storageos.useFastMD5=1 -Dcom.twmacinta.util.MD5.NATIVE_LIB_FILE=/opt/storageos/lib/MD5.so -Dsun.security.jgss.native=true -Dsun.security.jgss.lib=libgssglue.so.1 -Djavax.security.auth.useSubjectCredsOnly=false -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:+PrintGCDateStamps -Xloggc:/opt/storageos/logs/dataheadsvc-gc-9.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=3 -XX:GCLogFileSize=50M com.emc.storageos.data.head.Main file:/opt/storageos/conf/datahead-conf.xml

admin@ecs-node1:/tmp> date
Wed Jun  7 20:28:41 UTC 2017

3.       Verify the Installed Certificates.  The object certificate and management certificate each have their own GET request to retrieve the installed certificate.  Note that these commands are management requests.

a.       Verify the installed/Active Management Certificate

An alternative method to this one, which I used personally, is the OpenSSL s_client command.  The details in step 3a below aren’t necessary if you are going to use s_client for verification, I’ve simply included them here for completeness.   You can skip to step 3b for the s_client method.

  • Use ECSCLI to run it from a client PC:
python ecscli.py vdc_keystore get –hostname <ecs host ip> -port 4443 –cf <cookiefile>
  • Use CURL to run it directly from the ECS management console:
curl -svk -H "X-SDS-AUTH-TOKEN: $TOKEN" https://10.10.10.1:4443/vdc/keystore

Verify the installed (active) Object Certificate.  This can be done using a variety of methods, outlined below.

  • Use ECSCLI to run it from a client PC:
python ecscli.py keystore show –hostname <ecs host ip> -port 4443 –cf <cookiefile>
  • Use CURL to run it directly from the ECS management console:
curl -svk -H "X-SDS-AUTH-TOKEN: $TOKEN" https://10.10.10.1:4443/object-cert/keystore

b.      The certificate presented by a port can also be verified using OpenSSL’s s_client tool.  If you used the method in step 3a, this is unnecessary as it will give you the same information.

  • Sample command syntax:
openssl s_client -connect host:port -showcerts
  • The command syntax I used and some sample output from my ECS environment is below.  Verify the certificate on last node as well as the expected SAN entries.
openssl s_client -connect 10.10.10.1:9021 | openssl x509 -noout -text
openssl s_client -connect 10.10.10.1:9021 -showcerts
CONNECTED(00000003)
depth=0 C = US, ST = North Dakota, L = Fargo, OU = server, O = CompanyName Worldwide, CN = *.nd.dev.ecs.CompanyName.int
verify error:num=18:self-signed certificate
verify return:1
depth=0 C = US, ST = North Dakota, L = Fargo, OU = server, O = CompanyName Worldwide, CN = *.nd.dev.ecs.CompanyName.int
verify return:1
---
Certificate chain
0 s:/C=US/ST=North Dakota/L=Fargo/OU=server/O=CompanyName Worldwide/CN=*.nd.dev.ecs.CompanyName.int
   i:/C=US/ST=North Dakota/L=Fargo/OU=server/O=CompanyName Worldwide/CN=*.nd.dev.ecs.CompanyName.int
-----BEGIN CERTIFICATE-----
MIIFQDCCBCigAwIBAgIJANzBojR+ij2xMA0GCSqGSIb3DQEBBQUAMIGRMQswCQYD
VQQGEwJVUzERMA8GA1UECBMITWlzc291cmkxFDASBgNVBAcTC1NhaW50IExvdWlz
   … <output truncated for this example> …

c.       The process is now complete.  You can have your application team test SSL access to ensure everything is working properly.

 

 

 

 

 

Installing the EMC ECS CLI Package

Below is an outline on installing EMC’s ECS CLI package.  Once it’s installed, I have another blog post that outlines all of the ECSCLI commands here.

Getting Started

Prerequisites:

Install Python Requests Package:

  • Versions of ECSCLI prior to 3.x may require a manual install of the python requests package.  When I installed v3.1.9, the PIP install process appears to have taken care of installing the python requests package for me, but I saw reports of this issue while reading other documentation.   Either way, you can manually install the requests package either by using “pip install requests” or downloading the code from GitHub and running “python setup.py install”.

Install ECSCLI using Python PIP:

  • There are frequent updates and fixes being made to the ECSCLI package. The latest version of ECSCLI can always be downloaded and installed via pip using “pip install ecscli” from a windows command prompt.  PIP will be in your system path once you’ve installed python so it can be run from any directory.  If you want to archive a copy, use “pip download ecscli” rather than “pip install ecscli”.  As an alternative, you can also find the ECSCLI install package available for download at EMC’s support site (v2 is available here).

ECS CLI PIP Installation and Configuration

You will need to set up a configuration profile once ECSCLI is installed.  Configuration profiles address issues with older versions of the ECSCLI regarding authentication and python dependencies.  A profile simply contains the hostname and port along with an existing management user who will be authenticating to that host.  Several profiles can be created but only one can be active.  Once the active profile is set, ECSCLI will then use that profile for authenticating and sending commands.

To install the ecscli via pip:

pip install ecscli

Collecting ecscli
Downloading ecscli-2.2.0a5.tar.gz (241kB)
100% |████████████████████████████████| 245kB 568kB/s
Requirement already satisfied (use --upgrade to upgrade): requests in ./anaconda/envs/ecscli_demoenv/lib/python2.7/site-packages (from ecscli)
Building wheels for collected packages: ecscli
Running setup.py bdist_wheel for ecscli ... done
Stored in directory: /Users/conerj/Library/Caches/pip/wheels/92/7f/c3/129ffe5cd1b3b20506264398078bdd886c27fefe89b062b711
Successfully built ecscli
Installing collected packages: ecscli
Successfully installed ecscli-2.2.0a5

To see a list of profiles:

ecscli config list

Running without an acive config profile
list of existing configuration profiles:

Since the ecscli was just installed, no profiles exist yet.

Once you have an active profile, the output will look like this:

Running with config profile: C:\python\ecscli/ecscliconfig_demouser_.json
user: root host:port: 10.10.10.1:4443
list of existing configuration profiles:
ACTIVE  |PROFILE   |HOSTNAME   |PORT   |MGMT USER   |ECS VERSION
----------------------------------------------------------------
        |demouser  |10.10.10.1 |4443   |root        |3.0

To create a profile:

ecscli config -pf demoprofile

Running without an acive config profile
Please enter the default ECS hostname or ip (127.0.0.1):10.10.10.11
Please enter the default command port (4443):
Please enter the default user for the profile (root):
Entered saveConfig profileName = demoprofile
will be saved to base path: /Users/demo_user/ecscliconfig_
Saving profile config to: /Users/demo_user/ecscliconfig_demoprofile_.json
list of existing configuration profiles:
     * demoprofile2 - hostname:10.10.10.11:4443       user:root

Normally one profile will always be active.  Because this is the first time a profile is being created, ECSCLI will run without an active profile. The CLI will prompt the user to enter the hostname, IP, port and management user for the profile. The “*” shows the active profile that will be used. Several profiles can be configured, however only one profile can be active at a time. The profiles are stored in .json files in the home directory with the name prefix “ecscliconfig_”.

To see a list of profiles and the active profile:

ecscli config list

Running with config profile: demoprofile
user: demo_user    host:port: 10.10.10.10:4443
list of existing configuration profiles:
    * demoprofile2 - hostname:10.10.10.11:4443 user:demouser
      demoprofile  - hostname:10.10.10.10:4443 user:root

The currently active profile is denoted by “*” before the profile name.

To change the active profile:

ecscli config set -pf mydemoprofile

Running with config profile: demoprofile2
user: demo_user    host:port: 10.10.10.11:4443
list of existing configuration profiles:
   demo_profile2 - hostname:10.10.10.11:4443 user:demouser
   demo_profile  - hostname:10.10.10.10:4443 user:root

To delete a profile:

ecscli config delete -pf mydemoprofile

Running with config profile: demoprofile
user: root  host:port: 10.10.10.10:4443
list of existing configuration profiles:
* demoprofile2 - hostname:10.10.10.11:4443 user:demouser

Since the currently active profile was deleted in this example, the ecscli chose another profile to set as the active profile.

Ecscli configuration handles the “–hostname” and “–port” arguments and manages the tokens for subsequent management requests.  Authentication is still required. This as well as all other requests are simplified since cookie related arguments are no longer required.

To Authenticate:

ecscli authenticate

Running with config profile: demoprofile2
user: root  host:port: 10.10.10.10:4443
Password :
authentication result: root : Authenticated Successfully
/Users/demo_user/demo_profile/rootcookie : Cookie saved successfully

Another sample command:

This command example will list the storage pools:

ecscli objectvpool list

Running with config profile: demo_rofile
user: root    host:port: 10.10.10.10:4443
{'global_data_vpool': [{'isAllowAllNamespaces': True, 'remote': None, 'name': 'lab_env', 'enable_rebalancing': True, 'global': None, 'creation_time': 1033186012844, 'isFullRep': False, 'vdc': None, 'inactive': False, 'varrayMappings': [{'name': 'urn:storageos:VirtualDataCenterData:823c6f4c-bda2-6ca2-69d7-110df3e9f022', 'value': 'urn:storageos:VirtualArray:19f03490-3f30-25dd-5f5c-8b208f64e3f0'}], 'id': 'urn:storageos:ReplicationGroupInfo:8066234b-bdc2-6234-f066-81f0aa61e7bf:global', 'description': ''}]}

Using isi_vol_copy_vnx for VNX to Isilon data migration

For most data migrations from VNX to Isilon EMC recommends that you use the OneFS migration tool isi_vol_copy_vnx. It can often be more efficient than host-based tools (such as EMCopy and Robocopy) because the performance of host-based tools performance is dependent on the network connectivity of the host, while isi_vol_copy_vnx depends only on the network connection between the source device and the Isilon cluster. Below is a basic outline of the syntax, the steps required, and a few troubleshooting tips.

You might consider migrating data with a host based tool if one or more of the following conditions apply to your migration:

  • The source device and Isilon cluster are on separate networks.
  • Security restrictions prevent the source device and Isilon cluster from communicating directly.

Command Syntax:

isi_vol_copy_vnx --help

The source must contain a source host, a colon, and then the absolute source path name.

 isi_vol_copy_vnx <src_filer>:<src_dir> <dest_dir>
                 [-sa user: | user:password]
                 [-sport ndmp_src_port]
                 [-dport ndmp_data_port]
                 [-full | -incr] [-level_based]
                 [-dhost dest_ip_addr]
                 [-no_acl]

 isi_vol_copy_vnx -list [migration-id] | [[-detail] [-state=<state>] [-destination=<pathname>]]
 isi_vol_copy_vnx -cleanup <migration-id> [-everything [-noprompt]]
 isi_vol_copy_vnx -get_config
 isi_vol_copy_vnx -set_config <name>=<value>
 isi_vol_copy_vnx -h | -help
 Defaults:
   src_auth_user       = root
   src_auth_password   =
   ndmp_src_port       = 0  (0 means NDMP default, usually 10000)
   ndmp_data_port      = any
   dest_ip_addr        = none

Note: This tool uses NDMP to transfer the data from the source VNX to the Isilon.

Migration Steps:

  1. Configure NDMP User

Create a new NDMP user on the source VNX. Log in to the control station and run the following command:

/nas/sbin/server_user -add <new_username> -ndmp_md5 -passwd

Select the defaults when prompted and be sure to make note of the password.

  1. Determine the absolute path of your filesystems and shares

If you’re using virtual data movers it changes the root path of your filesystem.  Issue the following command to review your file systems and mount paths:

server_mount ALL

Note the the specific path for the file system that is targed for migration. The path when using a VDM will be similar to this:

FILESYSTEM1 on /root_vdm_1/FILESYSTEM1 uxfs,perm,rw

In this case the path will be /root_vdm_1/FILESYSTEM1, which will be used for the source path in the isi_vol_copy_vnx command.

    3. Determine the target Isilon Data Location

Determine the destination location on the Isilon in the /ifs/data folder where the data will migrated.  If the destination folder doesn’t exist on the Isilon, it will create it with the exact same NTFS permissions as the source.  Create the command with the following syntax:

isi_vol_copy_vnx <datamoverIP>:<source_path> <target Isilon path> -sa : <-full or -incr>

isi_vol_copy_vnx 10.10.10.10:/root_vdm_1/FILESYSTEM1 /ifs/data/FILESYSTEM1 -sa ndmpuser1: -full
  1. Migrate the Data

The command outlined above will run a full copy using the ndmpuser1 account and will prompt for a password, it does not have to be shown in plain text. The password can be specified in the command by using the appropriate syntax (), however you will still be required to follow the username with a colon.

If successful, the message “msg SnapSure file system creation succeeds” will appear, which means the NDMP session created a checkpoint successfully and is starting to copy data from that checkpoint.

Note that this does not migrate shares, just data.  Sharedupe can be utilized for that or the CIFS shares/NFS exports can be manually re-created.  It is recommended that any other data migrations on the source VNX be disabled prior to the copy so that you don’t run into performance issues.

Caveats:

  • There is no bandwidth throttling option with this command, it will consume all available bandwidth.
  • Isilon does not  support Historical SIDs in versions 8.0.0 or earlier, which may result in permission issues due to being unable to resolve historical SIDs post migration from other platforms (see KB468562).  If SIDHistory is in use on the source, then this is not the proper tool.   From the comments section, please note that OneFS does support Security Identifier (SID) history beginning in 8.0.1 and later releases, and 8.0.0.3 and later releases (see latest docu44518_Isilon-Supportability-and-Compatibility-Guide).
  • If fs_dedupe is enabled on the Celerra or VNX, you will need to change the backup threshold to zero for each filesystem.  This means that when sending the data over NDMP send the full file, not the compressed or de-duped version.  Note that there is a risk here of inflating existing backup sets if they are being done over NDMP.
  • On the source the account performing the copy needs local administrator or backup operators permissions to the source CIFS Server, and full control over the source share.
  • Standard NDMP backups and isi_vol_copy_vnx can affect each other and the data backed up by the two NDMP clients.  See KB187008 for a workaround.

Best Practices:

  • isi_vol_copy target data use

Do not touch the data on the target Isilon until after the isi_vol_copy has completed.

Why:  This will create problems and you may have to re-do a full copy.

  • Simultaneous isi_vol_copy use

Do not execute multiple isi_vol_copy commands going to the same target, i.e. don’t have all your isi_vol_copy migrations going to the same target directory.

For example:
filer1:/vol/sourcedir -> isilon:/ifs/data
filer2:/vol/sourcedir2 -> isilon:/ifs/data

Why:  Creates problems for the copy process and may require remediation after migration.

Instead: Use an additional directory level:
filer1:/vol/sourcedir -> isilon:/ifs/data/filer1/sourcedir
filer2:/vol/sourcedir2-> isilon:/ifs/data/filer2/sourcedir

If consolidation is required, this can happen after the data is migrated and any potential merging of identically named subdirectories can be addressed.

  • isi_vol_copy use

isi_vol_copy is optimized to streams as much data as possible across a network, always monitor load on the source and target systems for potential impact.

Why:  Since isi_vol_copy is optimized to streams as much data possible, don’t overwhelm older source systems and create potential link saturation or disk problems especially if there are users connected and attempting to access files.

  • isi_vol_copy limits

Recommend less than 40 million files per volume transfer when using isi_vol_copy.

Why:  All programs have limits and this is the recommended maximum when using isi_vol_copy for each individual transfer.  Larger source volumes should be broken up into smaller chunks (i.e. use a separate isi_vol_copy stream for multiple subdirectories instead of one large transfer of an entire volume).

  • Optimize the network for the migration traffic

Optimize the migration network path; look to limit other production traffic from this network, limit network devices the traffic traverse (firewalls, IDS etc.). Ideally look to create a dedicated private migration network that can be optimized for only the migration traffic.

Why:  Separating the migration traffic from other network traffic will allow for maximum throughput and reduce potential impact to existing production traffic by limiting network saturation.

6. Use a specific migration account or account with group membership that has the required access to all source and target data, i.e. root.

Why:  Using a dedicated account will allow for oversight and management of the migration data access.  It will also allow for the separation of migration tasks and users from other production accounts

  • Watch out for root_squash

On the source cluster exports sometimes restrict access by using root_squash to prevent root users connecting remotely and having root privileges.  But this is something we need for migrating data.  Use the option, “no_root_squash” to turn off root squashing.

Why:  Must have root access (or equivalent) to migrate all files and directories.

  • NFS Exports

Create the new Isilon NFS exports and permissions prior to the data migration.

Why:  This will allow the creation and setup of the exports and export permissions prior to data migration and cutover for initial testing and access validation

Troubleshooting Tips:

  1. Checkpoint Creation on the VNX

The most common issue when running the isi_vol_copy_vnx command is with checkpoint creation on the source VNX.

If you are receiving a message that’s similar to msg SnapSure file system creation failed during a copy session, the command is failing to create a snapshot of the source file system. It could be due to many reasons including a lack of available disk space. Try manually creating a snapshot on the source VNX file system to see if it fails, below is the syntax to do so:

#fs_ckpt Test_FS -name Test_FS-ckpt01 –Create
  1. Permission or Connection Issues.

In general the error message itself will be self explanatory. Make sure you are using the correct credentials for the NDMP user in the migration command. The user should have sufficient rights on the source system and target systems. The logged in user should be able to create and modify directories and the files contained within.  As an example, in the case below, NDMP Port 53169 is blocked between the VNX and the Isilon.  Opening the port on the firewall resolved the issue.

ISILON568-1# isi_vol_copy_vnx 10.10.10.10:/Volcopytest /ifs/data/Volcopytest -sa Volcopytest:Volcopytest -sport 53169 -full -dhost 10.10.10.11
system call (connect): Connection refused
Could not open NDMP connection to host 10.10.10.10
isi_vol_copy_vnx did not run properly

3.  32bit Unix Application Issues.

If your app is 32bit, 32bit settings on the new NFS export must be enabled.

ISILON# isi nfs exports modify EXPID –zone=NFSZone –return-32bit-file-ids=yes

Replace the EXPID with the ID of the target export, verify the bit settings by viewing the export.

ISILON# isi nfs exports view EXPID –zone=NFSZone | grep -i return

ISILON# _

4. Snapshot creation on the target Isilon array.

Snapshots can fail for many reasons, but most often it’s due to lack of available space. In the example below the snapshot creation failed because there was an existing snapshot with the same name.

ISILON568# isi_vol_copy_vnx VNX-SERVER3:/Test_FS/NFS01 /ifs/data/Test_NFS01 -sa ndmp:NDMPpassword -incr
Snapshot with conflicting name ‘isi_vol_copy.011.1.snap’ found. Remove/Rename the conflicting snapshot to continue with further migration runs.
snapshot already exists
ISILON568-1#

Either delete or rename the existing snap to resolve the issue.  In the example below the snapshot was deleted.

ISILON568-1# isi snapshot snapshots list| grep isi_vol_copy.011
134 isi_vol_copy.011.0.snap /ifs/.ifsvar/modules/isi_vol_copy/011/persistent_store
136 isi_vol_copy.011.1.snap /ifs/.ifsvar/modules/isi_vol_copy/011/persistent_store
ISILON568-1#

ISILON568-1# isi snapshot snapshots delete –snapshot=134
Are you sure? (yes/[no]): y
ISILON568-1#

VNX Block and File Password Change Procedure

Below is the procedure for changing the passwords on a Unified VNX on both the block and file sides.

Please note that Changing the Global VNX administrator password can cause communication failures between the Control Station and the array, the issue is documented in emc261195 and is the reason I’m adding this post, I was researching how to avoid the issue. The article notes that in DART OS versions newer than v7.0.14 the synchronization was automated and the cached credentials are updated automatically, in DART OS v7.0.14 and prior you must do it manually on the active control station.

Whenever a change is made to the active Control Station always verify that the standby control station configuration matches on takeover.  Takeover is initiated by the standby control station, failover is initiated by the active control station. As an example, if the time zone is changed on the active control station it is not part of the synchronization during the failover process. Time zone changes needs to be configured separately on each one and is a setting that requires a reboot to take effect. Unisphere will prompt you to do so as a reminder, however on takeover/failover the newly promoted control station never reboots.

Block Side: Change the sysadmin global domain account

A) Updating global domain account password

1) Log into Unisphere with the sysadmin global account, using the control station IP

2) From the “All Systems” page select “Domains”

3) Select “Manage Global Users”

4) Highlight sysadmin and click on “Modify”

5) Change the password

B) Update Security on Control Station

1) Open a putty session to the primary control station and run the commands below. They should be as is, however a possible exception would be the first 2 might make you add userid/password info before it would be accepted (add -user sysadmin –password “pswd ” –scope 0 to the commands below).

/nas/sbin/naviseccli -h spa -AddUserSecurity -user sysadmin -scope 0

/nas/sbin/naviseccli -h spb -AddUserSecurity -user sysadmin -scope 0

nas_storage –modify id=1 –security

 C) Verify the updated sysadmin password in the following locations:

1) Via Unisphere (Log in with the newly changed password)

2) Verify communication between the control station and storage processors on each array:

* Log in to the active control station via putty using the nasadmin local account

* Run the following commands:

/nas/sbin/navicli -h SPA getagent

nas_storage -check -all

The NAS storage check command should respond with “done”.

File Side: Change the nasadmin and root local accounts

A) Local accounts need to be modified on each array individually

1) Log into Unisphere with the sysadmin global account

2) Select the desired array

3) Click on “Settings” -> “Security” -> “Local Users for File”

4) Highlight nasadmin click on “Properties”

5) Change the password

6) Highlight root, click on Properties

7) Change the password

Note that the password for the local nasadmin and root accounts can also be changed from the CLI:

[root@fakevnxprompt ~]# passwd nasadmin
Changing password for user nasadmin. 
New UNIX password: <enter a password>
BAD PASSWORD: it is based on a dictionary word 
Retype new UNIX password: 
passwd: all authentication tokens updated successfully. 
[root@fakevnxprompt ~]#

B) Verify both passwords

1) Log in to the active control station via putty as nasadmin, verify your newly changed password

2) run the su command and verify your newly changed root password

C) Propagate changes to the standby control station

At this point the standby control station local account passwords have not yet been updated. It’s now time to test control station failover.  You can review one of my related prior blog posts on control station failover here.

1) While logged in to the active control station with root privileges, run this command:

/nas/sbin/cs_standby -failover

This will synchronize the control stations, reboot the active control station, and then make the standby control station active.

Caveats: Please note the expected issues listed below as part of out-of-band communication without an online, active control station:

* In-band production data will not be disrupted

* Data mover failover cannot occur

* Auto-extension of filesystems will not occur

* Scheduled checkpoints will not occur

* Replication sessions may be disrupted

2) Log in to the active control station (the previous standby control station)

3) Verify the new nasadmin password

4) su and verify the root password

5) Fail back to the original primary control station:

/nas/sbin/cs_standby -failover

 

 

 

Matching LUNs and UIDs when presenting VPLEX LUNs to Unix hosts

Our naming convention for LUNs includes the pool ID, LUN number, server name, filesystem/drive letter, last four digits of the array’s serial number, and size (in GB). Having all of this information in the LUN name makes for very easy reporting and identification of LUNs on a server.  This is what our LUN names look like: P1_LUN100_SPA_0000_servername_filesystem_150G

Typically, when presenting a new LUN to our AIX administration team for a new server build, they would assign the LUNs to specific volume groups based on the LUN names. The command ‘powermt display dev=hdiskpower#’ always includes the name & intended volume group for the LUN, making it easy for our admins to identify a LUN’s purpose.  Now that we are presenting LUNs through our VPlex, when they run a powermt display on the server the UID for the LUN is shown, not the name.  Below is a sample output of what is displayed.

root@VIOserver1:/ # powermt display dev=all
Pseudo name=hdiskpower0
VPLEX ID=FNM00141800023
Logical device ID=6000144000000010704759ADDF2487A6 (this would usually be displayed as a LUN name)
state=alive; policy=ADaptive; queued-IOs=0
==============================================================================
————— Host ————— – Stor – — I/O Path — — Stats —
### HW Path I/O Paths Interf. Mode State Q-IOs Errors
==============================================================================
1 fscsi1 hdisk8 CL1-0B active alive 0 0
1 fscsi1 hdisk6 CL1-0F active alive 0 0
0 fscsi0 hdisk4 CL1-0D active alive 0 0
0 fscsi0 hdisk2 CL1-07 active alive 0 0

Pseudo name=hdiskpower1
VPLEX ID=FNM00141800023
Logical device ID=6000144000000010704759ADDF2487A1 (this would usually be displayed as a LUN name)
state=alive; policy=ADaptive; queued-IOs=0
==============================================================================
————— Host ————— – Stor – — I/O Path — — Stats —
### HW Path I/O Paths Interf. Mode State Q-IOs Errors
==============================================================================
1 fscsi1 hdisk9 CL1-0B active alive 0 0
1 fscsi1 hdisk7 CL1-0F active alive 0 0
0 fscsi0 hdisk5 CL1-0D active alive 0 0
0 fscsi0 hdisk3 CL1-07 active alive 0 0

In order to easily match up the UIDs with the LUN names on the server, an extra step needs to be taken on the VPlex CLI. Log in to the VPlex using a terminal emulator, and once you’re logged in use the ‘vplexcli’ command. That will take you to a shell that allows for additional commands to be entered.

login as: admin
Using keyboard-interactive authentication.
Password:
Last login: Fri Sep 19 13:35:28 2014 from 10.16.4.128
admin@service:~> vplexcli
Trying ::1…
Connected to localhost.
Escape character is ‘^]’.

Enter User Name: admin

Password:

VPlexcli:/>

Once you’re in, run the ls -t command with the additional options listed below. You will need to substitute the STORAGE_VIEW_NAME with the actual name of the storage view that you want a list of LUNs from.

VPlexcli:/> ls -t /clusters/cluster-1/exports/storage-views/STORAGE_VIEW_NAME::virtual-volumes

The output looks like this:

/clusters/cluster-1/exports/storage-views/st1pvio12a-b:
Name Value
————— ————————————————————————————————–
virtual-volumes [(0,P1_LUN411_7872_SPB_VIOServer1_VIO_10G,VPD83T3:6000144000000010704759addf2487a6,10G),
(1,P0_LUN111_7872_SPA_VIOServer1_VIO_10G,VPD83T3:6000144000000010704759addf2487a1,10G)]

Now you can easily see which disk UID is tied to which LUN name.

If you would like to get a list of every storage view and every LUN:UID mapping, you can substitute the storage view name with an asterisk (*).

VPlexcli:/> ls -t /clusters/cluster-1/exports/storage-views/*::virtual-volumes

The resulting report will show a complete list of LUNs, grouped by storage view:

/clusters/cluster-1/exports/storage-views/VIOServer1:
Name Value
————— ————————————————————————————————–
virtual-volumes [(0,P1_LUN421_9322_SPB_/clusters/cluster-1/exports/storage-views/ VIOServer2:
Name Value
————— ————————————————————————————————–
virtual-volumes [(0,P1_LUN421_9322_SPB_VIOServer2_root_75G,VPD83T3:6000144000000010704759addf248ad9,75G),
(1,R2_LUN125_9322_SPB_VIOServer2_redo2_12G,VPD83T3:6000144000000010704759addf248b09,12G),
(2,R2_LUN124_9322_SPA_VIOServer2_redo1_12G,VPD83T3:6000144000000010704759addf248b04,12G),
(3,P3_LUN906_9322_SPB_VIOServer2_oraarc_250G,VPD83T3:6000144000000010704759addf248aff,250G),
(4,P2_LUN706_9322_SPA_VIOServer2_oraarc_250G,VPD83T3:6000144000000010704759addf248afa,250G)]

/clusters/cluster-1/exports/storage-views/VIOServer2:
Name Value
————— ————————————————————————————————
virtual-volumes [(1,R2_LUN1025_9322_SPB_VIOServer2_redo2_12G,VPD83T3:6000144000000010704759addf248b09,12G),
(2,R2_LUN1024_9322_SPA_VIOServer2_redo1_12G,VPD83T3:6000144000000010704759addf248b04,12G),
(3,P3_LUN906_9322_SPB_VIOServer2_ora1_250G,VPD83T3:6000144000000010704759addf248aff,250G),
(4,P2_LUN706_9322_SPA_VIOServer2_ora2_250G,VPD83T3:6000144000000010704759addf248afa,250G)]

/clusters/cluster-1/exports/storage-views/VIOServer3:
Name Value
————— ————————————————————————————————
virtual-volumes [(0,P0_LUN101_3432_SPA_VIOServer3_root_75G,VPD83T3:6000144000000010704759addf248a0a,75G),
(1,P0_LUN130_3432_SPA_VIOServer3_redo1_25G,VPD83T3:6000144000000010704759addf248a0f,25G),

Our VPlex has only been installed for a few months and our team is still learning.  There may be a better way to do this, but it’s all I’ve been able to figure out so far.

The steps for NFS exporting a file system on a VDM

I made a blog post back in January 2014 about creating an NFS export on a virtual data mover but I didn’t give much detail on the commands you need to use to actually do it. As I pointed out back then, you can’t NFS export a VDM file system from within Unisphere however when a file system is mounted on a VDM its path from the root of the physical Data Mover can be exported from the CLI.

The first thing that needs to be done is determining the physical Data Mover where the VDM resides.

Below is the command you’d use to make that determination:

[nasadmin@Celerra_hostname]$ nas_server -i -v name_of_your_vdm | grep server
server = server_4

That will show you just the physical data mover that it’s mounted on. Without the grep statement, you’d get the output below. If you have hundreds of filesystems it will cause the screen to scroll the info you’re looking for off the top of the screen. Using grep is more efficient.

[nasadmin@Celerra_hostname]$ nas_server -i -v name_of_your_vdm
id = 1
name = name_of_your_vdm
acl = 0
type = vdm
server = server_4
rootfs = root_fs_vdm_name_of_your_vdm
I18N mode = UNICODE
mountedfs = fs1,fs2,fs3,fs4,fs5,fs6,fs7,fs8,…
member_of =
status :
defined = enabled
actual = loaded, active
Interfaces to services mapping:
interface=10-3-20-167 :cifs
interface=10-3-20-130 :cifs
interface=10-3-20-131 :cifs

Next you need to determine the file system path from the root of the Data Mover. This can be done with the server_mount command. As in the prior step, it’s more efficient if you grep for the name of the file system. You can run it without the grep command, but it could generate multiple screens of output depending on the number of file systems you have.

[nasadmin@stlpemccs04a /]$ server_mount server_4 | grep Filesystem_03
Filesystem_03 on /root_vdm_3/Filesystem_03 uxfs,perm,rw

The final step is to actually export the file system using this path from the prior step. The file system must be exported from the root of the Data Mover rather than the VDM. Note that once you have exported the VDM file system from the CLI, you can then manage it from within Unisphere if you’d like to set server permissions. The “-option anon=0,access=server_name,root=server_name” portion of the CLI command below can be left off if you’d prefer to use the GUI for that.

[nasadmin@Celerra_hostname]$ server_export server_4 -Protocol nfs -option anon=0,access=server_name,root=server_name /root_vdm_3/Filesystem_03
server_4 : done

At this point the client can mount the path with NFS.

Dynamic allocation pool limit has been reached

We were having issues with our backup jobs failing on CIFS share backups using Symantec Netbackup.  The jobs died with a “status 24”, which means it was losing communicaiton with the source.  Our backup administrator provided me with the exact times & dates of the failures and I noticed that immediately preceding his failures this error appeared in the server log on the control station:

2012-08-05 07:09:37: KERNEL: 4: 10: Dynamic allocation pool limit has been reached. Limit=0x30000 Current=0x50920 Max=0x0
 

A quick google search came up with this description of the error:  “The maximum amount of memory (number of 8K pages) allowed for dynamic memory allocation has almost been reached. This indicates that a possible memory leak is in progress and the Data Mover may soon panic. If Max=0(zero) then the system forced panic option is disabled. If Max is not zero then the system will force a panic if dynamic memory allocation reaches this level.”

Based on the fact that the error shows up right before a backup failure I saw the correlation.  To fix it, you’lll need to modify the Heap Limit from the default of 0x00030000 to a larger size.  Here is the command to do that:

.server_config server_2 -v “param kernel mallocHeapLimit=0x40000” (to change the value)
.server_config server_2 -v “param kernel” (will list the kernel parameters).
 

Below is a list of all the kernel parameters:

Name                                                 Location        Current       Default
----                                                 ----------      ----------    ----------
kernel.AutoconfigDriverFirst                         0x0003b52d30    0x00000000    0x00000000
kernel.BufferCacheHitRatio                           0x0002093108    0x00000050    0x00000050
kernel.MSIXdebug                                     0x0002094714    0x00000001    0x00000001
kernel.MSIXenable                                    0x000209471c    0x00000001    0x00000001
kernel.MSI_NoStop                                    0x0002094710    0x00000001    0x00000001
kernel.MSIenable                                     0x0002094718    0x00000001    0x00000001
kernel.MsiRouting                                    0x0002094724    0x00000001    0x00000001
kernel.WatchDog                                      0x0003aeb4e0    0x00000001    0x00000001
kernel.autoreboot                                    0x0003a0aefc    0x00000258    0x00000258
kernel.bcmTimeoutFix                                 0x0002179920    0x00000002    0x00000002
kernel.buffersWatermarkPercentage                    0x0003ae964c    0x00000021    0x00000021
kernel.bufreclaim                                    0x0003ae9640    0x00000001    0x00000001
kernel.canRunRT                                      0x000208f7a0    0xffffffff    0xffffffff
kernel.dumpcompress                                  0x000208f794    0x00000001    0x00000001
kernel.enableFCFastInit                              0x00022c29d4    0x00000001    0x00000001
kernel.enableWarmReboot                              0x000217ee68    0x00000001    0x00000001
kernel.forceWholeTLBflush                            0x00039d0900    0x00000000    0x00000000
kernel.heapHighWater                                 0x00020930c8    0x00004000    0x00004000
kernel.heapLowWater                                  0x00020930c4    0x00000080    0x00000080
kernel.heapReserve                                   0x00020930c0    0x00022e98    0x00022e98
kernel.highwatermakpercentdirty                      0x00020930e0    0x00000064    0x00000064
kernel.lockstats                                     0x0002093128    0x00000001    0x00000001
kernel.longLivedChunkSize                            0x0003a23ed0    0x00002710    0x00002710
kernel.lowwatermakpercentdirty                       0x0003ae9654    0x00000000    0x00000000
kernel.mallocHeapLimit                               0x0003b5558c    0x00040000    0x00030000  (This is the parameter I changed)
kernel.mallocHeapMaxSize                             0x0003b55588    0x00000000    0x00000000
kernel.maskFcProc                                    0x0002094728    0x00000004    0x00000004
kernel.maxSizeToTryEMM                               0x0003a23f50    0x00000008    0x00000008
kernel.maxStrToBeProc                                0x0003b00f14    0x00000080    0x00000080
kernel.memSearchUsecs                                0x000208fa28    0x000186a0    0x000186a0
kernel.memThrottleMonitor                            0x0002091340    0x00000001    0x00000001
kernel.outerLoop                                     0x0003a0b508    0x00000001    0x00000001
kernel.panicOnClockStall                             0x0003a0cf30    0x00000000    0x00000000
kernel.pciePollingDefault                            0x00020948a0    0x00000001    0x00000001
kernel.percentOfFreeBufsToFreePerIter                0x00020930cc    0x0000000a    0x0000000a
kernel.periodicSyncInterval                          0x00020930e4    0x00000005    0x00000005
kernel.phTimeQuantum                                 0x0003b86e18    0x000003e8    0x000003e8
kernel.priBufCache.ReclaimPolicy                     0x00020930f4    0x00000001    0x00000001
kernel.priBufCache.UsageThreshold                    0x00020930f0    0x00000032    0x00000032
kernel.protect_zero                                  0x0003aeb4e8    0x00000001    0x00000001
kernel.remapChunkSize                                0x0003a23fd0    0x00000080    0x00000080
kernel.remapConfig                                   0x000208fe40    0x00000002    0x00000002
kernel.retryTLBflushIPI                              0x00020885b0    0x00000001    0x00000001
kernel.roundRobbin                                   0x0003a0b504    0x00000001    0x00000001
kernel.setMSRs                                       0x0002088610    0x00000001    0x00000001
kernel.shutdownWdInterval                            0x0002093238    0x0000000f    0x0000000f
kernel.startAP                                       0x0003aeb4e4    0x00000001    0x00000001
kernel.startIdleTime                                 0x0003aeb570    0x00000001    0x00000001
kernel.stream.assert                                 0x0003b00060    0x00000000    0x00000000
kernel.switchStackOnPanic                            0x000208f8e0    0x00000001    0x00000001
kernel.threads.alertOptions                          0x0003a22bf4    0x00000000    0x00000000
kernel.threads.maxBlockedTime                        0x000208f948    0x00000168    0x00000168
kernel.threads.minimumAlertBlockedTime               0x000208f94c    0x000000b4    0x000000b4
kernel.threads.panicIfHung                           0x0003a22bf0    0x00000000    0x00000000
kernel.timerCallbackHistory                          0x000208f780    0x00000001    0x00000001
kernel.timerCallbackTimeLimitMSec                    0x000208f784    0x00000003    0x00000003
kernel.trackIntrStats                                0x000209021c    0x00000001    0x00000001
kernel.usePhyDevName                                 0x0002094720    0x00000001    0x00000001

Using the server_stats command on Celerra / VNX File

Server_stats is a CLI based real time performance monitoring tool from EMC for the Celerra and VNX file.  This post is meant to give a quick overview of the server_stats command with some samples on using it in a scheduled cron job. If you’re looking to dive into using the server_stats feature, I’d suggest using the online manual pages (man server_stats) to get a good idea of all the features and reviewing the “Managing Statistics for VNX” Guide from EMC here:  http://corpusweb130.emc.com/upd_prod_VNX/UPDFinalPDF/en/Statistics.pdf.

I don’t personally use it so I can’t explain how to set it up, but there is an opensource tool that you can use to push server_stats data to graphite called vnx2graphite. You can get it here: http://www.findbestopensource.com/product/fatz-vnx2graphite.   You can download Graphite here: http://graphite.wikidot.com/.

Here is the command line syntax:

server_stats <movername>

   -list
| -info [-all|<statpath_name>[,…]]
| -service { -start [-port <port_number>]

| -stop
| -delete
| -status }

 | -monitor -action {status|enable|disable}
|[

[{ -monitor {statpath_name|statgroup_name}[,…]
| -monitor {statpath_name|statgroup_name}
[-sort <field_name>]
[-order {asc|desc}]
[-lines <lines_of_output>]

 }…]
[-count <count>]
[-interval <seconds>]
[-terminationsummary {no|yes|only}]
[-format {text [-titles {never|once|<repeat_frequency>}]|csv}]
[-type {rate|diff|accu}]
[-file <output_filepath> [-overwrite]]
[-noresolve]
]

Here’s an explanation of a few of the useful table options and what to look for:

Syntax:  server_stats server_2 -i <interval in sec> -c <# of counts> -table <stat>

table cifs 

-Look at uSec/call. The output is in microseconds, divide by 1000 to convert to milliseconds. This tells you how long it takes the celerra to perform specific CIFS operations.

table dvol 

-This is for disk stats.  It shows the write distribution across all volumes.  Look for IO balance across resources.

table fsvol 

-Use this to check filesystem IO.  You’ll be able to monitor which file systems are getting all of the IO with this table.

Start with an interval of 1 first to look for spikes or bursts and then increase it incrementally (10 seconds, 30 seconds, 1 minute, 5 minutes, etc). You can also use Celerra monitor to get Clariion stats.  Look at queueing, cache flushes, etc.  Writes should be through to cache on the Clariion, and unless your write cache is filling up they should be faster than reads.

Here are some sample commands and what they do:

server_stats server_2 -table fsvol -interval 1 -count 10

-This correlates the filesystem to the meta-volumes and shows the % contribution of write requests for each meta-volume (FS Write Reqs %).

server_stats server_2 -table net -interval 1 -count 10

-This shows Network in (KiB/s) / Network In (Pkts/s) to figure out the packet size.  Do this for in and for out to verify the standard MTU size.

server_stats server_2 -summary nfs,cifs -interval 1 -count 10

-This will give a summary of performance stats for nfs and cifs.

Here are some additional sample commands, and how you can add to your crontab to automatically collect performance data:

Collect CIFS and NFS data every 5 minutes:

*/5 * * * * /nas/bin/server_stats server_2 -monitor cifs.smb1,cifs.smb2,nfs.v2,nfs.v3,nfs.v4,cifs.global,nfs.basic -format csv -terminationsummary no -i 5 -c 60 -type accu -file “/nas/quota/slot_2/perfstats/data/server_2/server_2_`date ‘+\%FT\%T’|sed s/://g`” > /dev/null

In the command above the -type accu option tells the command to accumulates statistics upon each capture rather than starting back at a baseline of zero. You can also do ‘diff’ to capture the difference from interval to interval.

Collect diskVol performance stats every 5 minutes:

*/5 * * * * /nas/bin/server_stats server_2 -monitor diskVolumes-std -i 5 -c 60 -file “/nas/quota/slot_2/perfstats/data/server_2/server_2_`date ‘+\%FT\%T’|sed s/://g`” > /dev/null

Collect top_talkers data every 5 minutes:

*/5 * * * * /nas/bin/server_stats server_2 -monitor nfs.client -i 5 -c 60 -file “/nas/quota/slot_2/perfstats/data/server_2/server_2_`date ‘+\%FT\%T’|sed s/://g`” > /dev/null

Below are some useful nfs and cifs stats that you can monitor (pulled from DART 8.1.2-51).  For a full list, run the command server_stats server_2 -i.

cifs.global.basic.totalCalls,cifs.global.basic.reads,cifs.global.basic.readBytes,cifs.global.basic.readAvgSize,cifs.global.basic.writes,cifs.global.basic.writeBytes,
cifs.global.basic.writeAvgSize,cifs.global.usage.currentConnections,cifs.global.usage.currentOpenFiles

nfs.basic,nfs.client,nfs.currentThreads,nfs.export,nfs.filesystem,nfs.group,nfs.maxUsedThreads,nfs.totalCalls,nfs.user,nfs.v2,nfs.v3,nfs.v4,nfs.vdm

 

How to reserve a Celerra / VNX NAS share for a single file type or group of file types

Several years ago I posted on Celerra/VNX NAS file extension filtering (see here), but didn’t write about file system reservations for specific file types, which is also possible.  You can set up NAS shares so that only the file types you want stored there can be written to the share.

In order to do this, first navigate to the \\NAS_Server\C$ administrative share and open the “.filefilter” folder.  You’ll then want to create the following filter files to complete the configuration:

allfiles[@<sharename>][@NetBIOS_name] – this filter file prohibits all file types from being created on the share.  File types that you want excluded from this blanket deny are identified by regular filter files.

noext[@<sharename>][@NetBIOS_name] – this filter file prohibits files with no extensions from being created on the share.  It will prevent a user from saving a file with no filename extension.

<extension_name>[@<sharename>][@NetBIOS_name] – this filter file identifies the types of files you want allowed on the share.  You will need to configure the ACLs to identify which users and/or groups can create files on the share.  File types specified by regular filter files like this one are the exceptions to the allfiles restriction.  If you wanted to reserve a share for only outlook files and message files, you could create two filter files, pst[@][@NetBIOS_name] and msg[@][@NetBIOS_name], then set appropriate permissions.

 

How to create a clone of a file system on a Celerra or VNX using nas_copy

Below are the steps used to create a clone of a file system on a Celerra or VNX.  Cloning can be done to the same data mover, a different data mover on the same array, or to a completely different Celerra or VNX data mover.

  • The source file system needs to be mounted read-only, or alternately you can use a checkpointed copy of the file system.  Using a checkpoint is the least disruptive method if the source file system is being used in production.
  • The destination file system must be the same size or larger than the source file system and must also be mounted read-only.
  • If you created the new file system for the clone copy on the same storage pool as the source file system, the copy performance will suffer as you’re reading/writing to the same disks.
  • The ReplicatorV2 license must be enabled for nas_copy to work.

To check the status of your installed licenses, use nas_license -list. The output looks like this:

[nasadmin@celerra01 ~]$ nas_license -list
key status value
site_key online 51 56 2e 69
cifs online
nfs online
iscsi online
snapsure online
replicatorV2 online

  • To enable the replicator license, run this command:

nas_license -create replicatorV2.

  • Once the source file system (or checkpoint, depending on which one you chose) and target file system are mounted correctly, verify the correct interconnect you want to use for the copy. You can review the configured interconnects with this command:

[nasadmin@celerra01 ~]$ nas_cel -interconnect -list

  • To begin the clone, run the nas_copy command, which has the following syntax:

[nasadmin@celerra01 ~]$ nas_copy

-name <sessionName>
     -source
     {-fs {<name> | id=<fsId>} | -ckpt {<ckptName> | id=<ckptId>}
     -destination
     {-fs {id=<dstFsId>|<existing_dstFsName>}
     |-pool {id=<dstStoragePoolId>}|<dstStoragePool>}}
     [-from_base {<ckpt_name>|id=<ckptId>}]
     -interconnect {<name> | id=<interConnectId>}
     [-source_interface {<nameServiceInterfaceName> | ip=<ipaddr>}]
     [-destination_interface {<nameServiceInterfaceName> | ip=<ipaddr>}]
     [-overwrite_destination]
     [-refresh]
     [-full_copy]
     [-background]

  • Below is an example of a valid nas_copy command. This command will create a copy of the source files system on the same Data Mover.

[nasadmin@celerra01 ~]$ nas_copy -name copy_session -source -ckpt checkpoint_of_source_filesystem -destination -fs clone_of_Source_filesystem -interconnect loopback -background

  • You can monitor the progress of the clone copy using nas_replicate -list.  The output of the command looks like this:
Name               Type        Local Mover  Interconnect      Celerra        Status
Site1_VDM01        vdm         server_2     <--Site2_VNX5700  Site1_NS960    Stopped
Site2_Filesystem2  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
Site2_Filesystem3  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
Site2_Filesystem4  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
Site2_Filesystem5  filesystem  server_2     <--Site1_NS960    Site2_VNX5700  OK
.
.
.

Using Microsoft’s BranchCache with the Celerra & VNX

We recently moved the data from several of our small offices to being hosted at a central regional data center. In one case, a site that formerly had it’s own Celerra for file access was now accessing NAS through a WAN link.  As expected, access to files was much slower.  After doing some research on how to speed up file access for users at the branch office locations I came across Microsoft’s BranchCache feature.

BranchCache is a WAN bandwidth optimization technology and is available on Windows 7 & 8, Server 2008 R2 and Server 2012. When BrancheCache is enabled, it creates a cache of the content from the file server locally within the branch office. A client from the same network can then access the file very quickly from cache instead of having to download it across the WAN again. It’s a great feature for optimizing local link utilization, increasing the responsiveness of applications, and reducing WAN bandwidth consumption.

BranchCache can operate in either Hosted Cache Mode or Distributed Cache mode. In Hosted Cache mode, there is a Windows server configured to store the cached files. In distributed cache mode (appropriate for smaller sites) local clients in the office keep a copy of the content and make it available to other clients that access the same files.

Once your Windows Administrator has BranchCache configured on a server (or it’s been enabled on the local client PC’s), enabling it on the Celerra/VNX side is very simple. Log in to the CLI and su to gain root credentials. Then type in the following command:

server_cifs server_2 -smbhash -service enable

If you’d like to enable BranchCache auditing so the Windows server administrators can see audit info the Windows event logs, type in this command:

server_cifs server_2 -smbhash -audit enable

After that, you will need to restart the CIFS service on the data mover. Here are the commands to stop and start CIFS:

server_setup server_2 -P cifs -o stop
server_setup server_2 -P cifs -o start

To confirm that BranchCache was successfully enabled, type the following command:

server_cifs server_2 -smbhash -info

The output should look like this:

server_2:
Current smbhash parameters:
—————————
Enabled : Yes
Started : Yes

Finally, To enable hash support on each individual CIFS Share that you want to use for Branch Cache clients, use the command syntax below.  Note that EMC’s “Configuring  Branch Cache” document I link to at the bottom of this post does not contain this next command (major oversight).  It was pulled from page 58 of EMC’s Configuring CIFS Guide.

# server_export <datamover_name> -name <fs_name> -option <netbios_name_of_CIFS_Server>, type=hash /<fs_name>

EMC has a detailed document on how to configure BranchCache, including all the steps you’d need to take to configure the server and PC clients. If you have an EMC support account you can download it here:

https://support.emc.com/docu42265_Configuring-BranchCache-V2-on-VNX.pdf?language=en_US

There is also a lengthy section on BranchCache in EMC’s CIFS Management for VNX guide here:

https://mydocs.emc.com/VNXDocs/CIFS.pdf.

Here is the link to Microsoft’s TechNet guide for BranchCache:

http://technet.microsoft.com/en-us/library/hh831696.aspx.

Creating an NFS export from a file system on a Virtual Data Mover (VDM)

When creating a new file system on a virtual data mover, you may have noticed that it can’t be NFS exported from within Unisphere.  From the GUI, file systems associated with a VDM do not appear on the drop down list for NFS exports when you attempt to create one.  I had always assumed that it simply couldn’t be done until we had a business need for it and I investigated it a bit further.  As it turns out NFS exports can be mounted on a VDM path from the command line interface, with a few restrictions.

1. You need to export the NFS mountpoint at the physical Data Mover level (server_2, server_3, etc.) and include the nested path of the VDM.  The path is generally /root_vdm_X/fs_name. You cannot export the NFS on the VDM level.

2. Since the NFS export is done at the physical Data Mover level it is not restricted to one specific VDM.

3. The nested NFS export can only be done from command line, it is not an option from within Unisphere.  Once you have created the NFS export, however, you can use the GUI to make changes to it.

4. The client must map the NFS export by using celerra:/root_vdm_X/fs_name, or they need to set up NFS export aliasing to hide the nested path.

5. IP replication of the VDM will not replicate the source site NFS exports to the destination site. You will need to manually create those NFS exports on the destination side. Also, you cannot access those file systems unless the VDM has failed over.

Changing the Login banner on a Celerra / VNX File control station

If you have a security requirement to change the login screen on operating systems that are accessible via a terminal session, it’s a farily easy change on a Celerra or VNX File control station due to the DART OS being based on Linux.  Simply log in as root and edit the /etc/issue file with vi.  It contains the login banner that you see immediately after logging in to a control station.

This is what the default /etc/issue file contains on a VNX control station, and what you’ll see by default when you log in:

A customized version of the Linux operating system is used on the
EMC(R) VNX(TM) Control Station.  The operating system is
copyrighted and licensed pursuant to the GNU General Public License
(“GPL”), a copy of which can be found in the accompanying
documentation.  Please read the GPL carefully, because by using the
Linux operating system on the EMC VNX you agree to the terms
and conditions listed therein.
 
EXCEPT FOR ANY WARRANTIES WHICH MAY BE PROVIDED UNDER THE TERMS AND
CONDITIONS OF THE APPLICABLE WRITTEN AGREEMENTS BETWEEN YOU AND EMC,
THE SOFTWARE PROGRAMS ARE PROVIDED AND LICENSED “AS IS” WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT
NOT LIMITED TO, THE IMPLIED MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE.  In no event will EMC Corporation be liable to
you or any other person or entity for (a) incidental, indirect,
special, exemplary or consequential damages or (b) any damages
whatsoever resulting from the loss of use, data or profits,
arising out of or in connection with the agreements between you
and EMC, the GPL, or your use of this software, even if advised
of the possibility of such damages.
 
EMC, VNX, Celerra, and CLARiiON are registered trademarks or trademarks of
EMC Corporation in the United States and/or other countries. All
other trademarks used herein are the property of their respective
owners.
 
EMC VNX Control Station Linux release 3.0 (NAS 7.1.71)
 

I don’t work for IBM, but below is an example of what you could change the /etc/issue file to.  I included the hostname of the control station, our company logo, and a legal warning regarding unathorized logins and system monitoring.  Because the standard login banner from EMC includes the DART OS Version on the last line, I suspect that the /etc/issue file is updated along with any OS upgrades or patches, although I can’t confirm that right now.  I’m keeping a backup copy in the same directory to easily change it back after a DART upgrade.

You can also edit the /etc/motd file, which displays immediately after a successful login.  By default the file contains the message “EMC Celerra Control Station Linux” or “EMC VNX Control Station Linux”, along with what I assume to be the date of the original OS installation.

[sourcecode language=”css”]
—————————————————————
EMC VNX Control Station <Host_Name>
—————————————————————
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBB    MMMMMMM     MMMMMMM
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBBB   MMMMMMMM   MMMMMMMM
IIII         BBBB           BBBB  MMMM MMMM MMMM MMMM
IIII          BBBB           BBBB  MMMM MMMM MMMM MMMM
IIII          BBBBBBBBBBBBBBBBB    MMMM  MMMMMMM  MMMM
IIII          BBBBBBBBBBBBBBBBB    MMMM   MMMMM   MMMM
IIII          BBBB           BBBB  MMMM    MMM    MMMM
IIII          BBBB           BBBB  MMMM     M     MMMM
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBBB   MMMM           MMMM
IIIIIIIIIIIIIIIIIII  BBBBBBBBBBBBBBBBB    MMMM           MMMM
—————————————————————
Warning: This system is restricted to IBM authorized users for
business purposes only. Unauthorized access or use is
a violation of of company policy and the law.
—————————————————————
This system may be monitored for administrative and security
reasons. By logging in, you agree that you have read and
understand this notice.
—————————————————————
nasadmin@<Host_Name>’s password:
[/sourcecode]

Using the database query option on Celerra & VNX File commands

vnx1

EMC has added a hidden query option on some of the nas commands that allows you to directly query the NAS database. I’ve tested the ‘-query’ option n the nas_fs, nas_server, nas_slice and nas_disk commands, however it may be available on other commands as well. It’s a powerful command with a lot of options, so I took some time to play around with it today. The EMC documentation on this option in their command line reference guide is very sparse and I haven’t done much experimentation with it yet, but I thought I’d share what I’ve found so far.

You can view all of the possible query tags by typing in the commands below. There are dozens of tags available for query. The output list for all the commands is large enough that it’s not practical to add it into this blog post.

nas_fs -query:tags
nas_server -query:tags
nas_slice -query:tags
nas_disk -query:tags
 

Here’s a snippet of the tags available for nas_fs (the first seven):

Supported Query Tags for nas_fs:
Tag           Synopsis
———————————————————–
ACL         The Access Control List of the item
ACLCHKInProgress         Is ACLCHK running on this fs?
ATime         The UTC date of the item’s table last access
AccessPolicyTranslationErrorMessage         The error message of the MPD translation error, if any.
AccessPolicyTranslationPercentComplete         The percentage of translation that has been completed by the Access Policy translation thread for this item.
AccessPolicyTranslationState         The state of the Access Policy translation thread for this item.
AutoExtend         Auto extend True/False
 

The basic syntax for a query looks like this:

nas_fs -query:inuse==y:type=uxfs:IsRoot=False -Fields:Name,Id,StoragePoolName,Size,SizeValues -format:’%s,%s,%s,%d,%d\n’

In the above example, we are running the query based on three options. We want the file system to be in use, be of the type uxfs, and not be root. In the fields parameter, we want the file system’s name, ID, Storage Pool name, size, and size values. Because our output has five fields, and we want each file system to have it’s own line in the output, we add five formatting options separated by commas (for a csv type output), followed by a ‘\n’ to create a carriage return after each file system’s information has been outputted.

Here are the valid query operators:

= pattern match
== exact match
=- not less than
=+ not more than
=* any
=^ not having pattern
=^= not an exact match
=^- is less than
=^+ is greater than
=^* not any (none)

The format option is not well documented. The only parameters I’ve used for the format option are q and s. From what I’ve seen from testing the options, The tag used in the ‘-fields’ parameter is either simple or complex. A complex tag must be formatted with q, a simple tag must be formatted with s. Adding the ‘\n’ to the format option adds a carriage return to the output. If you use the wrong format parameter, it will display an error like this: “Error 2110: Invalid query syntax: the tag (‘[Tag Option]’) corresponding to this format specifier (“%q”) is not a complex tag”.

-format:’%s’ : Simple Formatting
-format:’%q’ : Complex Formatting

Below are some examples of using the query option. I will add more useful examples in the future after I’ve had more time to dive in to it.

To get the file system ID:

nas_fs -query:Name==[file system name] -format:’%s\n’ -fields:VolumeID

List all file systems with their sizes:

nas_fs -query:inuse==y:type=uxfs:IsRoot=False -Fields:Name,Id,StoragePoolName,Size,SizeValues -format:’%s,%s,%s,%d,%d\n’

List all file system quotas on the array:

nas_fs -query:\* -fields:ID,TreeQuotas -format:’%s:\n%q#\n’ -query:\* -fields:FileSystem,Path,BlockHardLimit,BlockSoftLimit,BlockGracePeriod,BlockUsage -format:’%s : %s : %s : %s : %s : %s\n’

List quota info for all file systems:

nas_fs -query:\* -fields:ID,TreeQuotas -format:’%s:\n%q#\n’ -query:\* -fields:FileSystem,Path,BlockHardLimit,BlockSoftLimit,BlockGracePeriod,BlockUsage -format:’%s : %s : %s : %s : %s : %s\n’

List all Checkpoint file systems:

nas_fs -query:inuse=y:type=ckpt:isroot=false -fields:ServersNumeric,Id,Name,SizeValues -format:’%s,%s,%s,%s\n’

List all the Server_params for a specific server_X:

nas_server -query:Name=server_2 -format:’%q’ -fields:ParamTable -query:Name== -fields:ChangeEffective,ConfiguredDefaultValue,ConfiguredValue,Current,CurrentValue,Default,Facility,IsRebootRequired,IsVisible,Name,Type,Description -format:’%s|%s|
%s|%s|%s|%s|%s|%s|%s|%s|%s|%s\n’

List the current Wins/DNSDomain Config:

nas_server -query:Name=server -fields:Name,DefaultWINS,DNSDomain -format:”%s,%s,%s\n”

For a detailed file system deduplication report:

nas_fs -query:inuse=y:isroot=false:type=uxfs -fields:Deduplications -format:’%q’ -query:RdeState=On -fields:Name,FsCapacity,SpaceSaved,SpaceReducedDataSize,DedupeRate,UnreducedDataSize,TimeOfLastScan -format:’%s,%s,%s,%s,%s,%s,%s\\n’

Relocating Celerra Checkpoints

On our NS-960 Celerra we have multiple storage pools defined and one of them that is specifically defined for checkpoint storage. Out Of the 100 or so file systems that we run a checkpoint schedule on, about 2/3 of them were incorrectly writing their checkpoints to the production file system pool rather than the defined checkpoint pool, and the production pool was starting to fill up. I started to research how you could change where checkpoints are stored. Unfortunately you can’t actually relocate existing checkpoints, you need to start over.

In order to change where checkpoints are stored, you need to stop and delete any running replications (which automatically store root replication checkpoints) and delete all current checkpoints for the specific file system you’re working on. Depending on your checkpoint schedule and how often it runs you may want to pause it temporarily. Having to stop and delete remote replications was painful for me as I compete for bandwidth with our data domain backups to our DR site. Because of that I’ve been working on these one at a time.

Once you’ve deleted the relevant checkpoints and replications, you can choose where to store the new checkpoints by creating a single checkpoint first before making a schedule. From the GUI, go to Data Protection | Snapshots | Checkpoints tab, and click create. If a checkpoint does not exist for the file system, it will give you a choice of which pool you’d like to store it in. Below is what you’ll see in the popup window.

——————————————————————————–
Choose Data Mover [Drop Down List ▼]
Production File System [Drop Down List ▼]
Writeable Checkpoint [Checkbox]:
Data Movers: server_2
Checkpoint Name: [Fill in the Blank]

Configure Checkpoint Storage:
There are no checkpoints currently on this file system. Please specify how to allocate
storage for this and future checkpoints of this file system.

Create from:
*Storage Pool [Option Box]
*Meta Volume [Option Box]

——————————————————————————–
Current Storage System: CLARiiON CX4-960 APM00900900999
Storage Pool: [Drop Down List ▼]    <—*Select the appropriate Storage Pool Here*
Storage Capacity (MB): [Fill in the Blank]

Auto Extend Configuration:
High Water Mark: [Drop Down List for Percentage ▼]

Maximum Storage Capacity (MB): [Fill in the Blank]

——————————————————————————–

Collecting info on active shares, clients, protocols, and authentication on the VNX

This is a reference guide for listing and counting shares, clients, protocol and authentication information, and virus scan information.  All of this information could be used for a full VNX audit report.

How to list and count shares/exports (by protocol):

You can use the server_export command to list and count how many shares you have.

server_export server_2 -Protocol cifs -list -all:  This command will give you a list of all CIFS Shares across all of your CIFS servers.  It will count a file system twice if it’s shared on more than one CIFS server.

server_export server_2 -Protocol cifs -list -all | grep :  This will give you a list of all CIFS shares on a specific CIFS server

server_export server_2 -Protocol cifs -list -all | grep | wc:  This will give you the number of CIFS shares on a specific CIFS server.  The “wc” command will output three numbers, the first number listed is the number of shares.

server_export server_2 -Protocol nfs -list -all:  This command will give you a list of all NFS exports.  It’s just like the previous commands, you can add “| wc” at the end to get a count.

How to list client connections by OS type:

To obtain information for the number of client connections you have by OS type, you’d have to use the server_cifs audit command.

To get a full list of every active connection by client type, use this command:

server_cifs server_2 -option audit,full | grep “Client(“

The output would look like this:

|||| AUDIT Ctx=0x022a6bdc08, ref=2, W2K Client(10.0.5.161) Port=49863/445
|||| AUDIT Ctx=0x01f18efc08, ref=2, XP Client(10.0.5.42) Port=3890/445
|||| AUDIT Ctx=0x02193a2408, ref=2, W2K Client(10.0.5.61) Port=59027/445
|||| AUDIT Ctx=0x01b89c2808, ref=2, Fedora6 Client(10.0.5.194) Port=17130/445
|||| AUDIT Ctx=0x0203ae3008, ref=2, Fedora6 Client(10.0.5.52) Port=55731/445
...

In this case, if I wanted to count only the number of Fedora6 clients, I’d use the command server_cifs server_2 -option audit,full | grep “Fedora6 Client”.  I could then add “| wc” at the end to get a count.

To do a full audit report:

The command server_cifs server_2 -option audit,full will do a full, detailed audit report and should capture just about anything else you’d need.  Every connection will have a detailed audit included in the report. Based on that output, it would be easy to run the command with a grep statement to pull only the information out that you need to create custom reports.

Below is a subset of what the output looks like from that command:

|||| AUDIT Ctx=0x0177206407, ref=2, W2K8 Client(10.0.0.1) Port=65340/445
||| CIFSSERVER1[DOMAIN] on if=<interface_name>
||| CurrentDC 0x0169fee808=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=c2de9f99-1945-11e2-a512-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0x1 NTcred(0x0125fc9408 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
|| Cnxp(0x0230414c08), Name=FileSystem1, cUid=0x1 Tid=0x1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem1
| NTFSExtInfo: shareFS:fsid=49, rc=830, listFS:nb=1 [fsid=49,rc=830]

|||| AUDIT Ctx=0x0210f89007, ref=2, W2K8 Client(10.0.0.1) Port=51607/445
||| CIFSSERVER1[DOMAIN] on <interface_name>
||| CurrentDC 0x01f79aa008=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K8, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=5b410977-bace-11e1-953b-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0x1 NTcred(0x01c2367408 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
|| Cnxp(0x0195d2a408), Name=Filesystem2, cUid=0x1 Tid=0x1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem2
| NTFSExtInfo: shareFS:fsid=4214, rc=19, listFS:nb=0

|||| AUDIT Ctx=0x006aae8408, ref=2, XP Client(10.0.0.99) Port=1258/445
 ||| CIFSSERVER1[DOMAIN] on if=<interface_name>
 ||| CurrentDC 0x01f79aa008=<Domain_Controller>
 ||| Proto=NT1, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
 ||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
 ||| Uid=0x3f NTcred(0x01ccebc008 RC=2 KERBEROS Capa=0x2) 'DOMAIN\Username'
 || Cnxp(0x019edd7408), Name=Filesystem8, cUid=0x3f Tid=0x3f, Ref=1, Aborted=0
 | readOnly=0, umask=22, opened files/dirs=3
 | Absolute path of the share=\Filesystem8
 | NTFSExtInfo: shareFS:fsid=35, rc=43, listFS:nb=0
 | Fid=2901, FNN=0x0012d46b40(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1
 |    Notify commands received:
 |    Event=0x17, wt=1, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0x3bca, Uid=0x3f, size=0x20
 | Fid=3335, FNN=0x00193baaf0(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1\Subdirectory1
 |    Notify commands received:
 |    Event=0x17, wt=0, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0xe200, Uid=0x3f, size=0x20
 | Fid=3683, FNN=0x00290471c0(FREE,0x0000000000,0), FOF=0x0000000000  DIR=\Directory1\Subdirectory1\Subdirectory2
 |    Notify commands received:
 |    Event=0x17, wt=0, curSize=0x0, maxSize=0x20, buffer=0x0000000000
 |    Tid=0x3f, Pid=0x2310, Mid=0x3987, Uid=0x3f, size=0x20

Adding a Celerra to a Clariion storage domain from the CLI

If you’re having trouble joining your Celerra to the storage domain from Unisphere, there is an EMC service Workaround for Joining it from the Navisphere CLI. When attempting it from Unisphere, it would appear to work and allow me to join but would never actually show up on the domain list.  Below is a workaround for the problem that worked for me. Run these commands from the Control Station.

Run this first:

curl -kv “hps://<Celerra_CS_IP>/cgi-bin/set_incomingmaster?master=<Clariion_SPA_DomainMaster_IP>,”

Next, run the following navicli command to the domain master in order to add the Celerra Control Station to the storage domain:

/nas/sbin/naviseccli -h 10.3.215.73 -user <userid> -password <password> -scope 0 domain -add 10.32.12.10

After a successful Join the /nas/http/domain folder should be populated with the domain_list, domain_master, and domain_users files.

Run this command to verify:

ls -l /nas/http/domain

You should see this output:

-rw-r–r– 1 apache apache 552 Aug  8  2011 domain_list
-rw-r–r– 1 apache apache  78 Feb 15  2011 domain_master
-rw-r–r– 1 apache apache 249 Oct  5  2011 domain_users
 

You can also check the domain list to make sure that an entry has been made.

Run this command to verify:

/nas/sbin/naviseccli -h <Clariion_SPA_DomainMaster_IP> domain -list

You should then see a list of all domain members.  The output will look like this:

Node:                     <DNS Name of Celerra>
IP Address:           <Celerra_CS_IP>
Name:                    <DNS Name of Celerra>
Port:                        80
Secure Port:          443
IP Address:           <Celerra_CS_IP>
Name:                    <DNS Name of Celerra>
Port:                        80
Secure Port:          443

Auto transferring reports from VNX to an IIS web server via FTP

I previously posted on how to create a script that monitors your Celerra replication jobs.  I have an intranet web page that is updated daily with many other reports (most of which I’ve posted about here), so I thought I’d add this one to the web page as well rather than having to search through my inbox for it every day.

Developing an easy and automated method of getting files from the Celerra to a windows based web server was my challenge.  I figured out an easy way to do this with FTP.  As my internal windows web server is also my internal FTP server, I can place the file directly in the public folder for easy web publishing.  Now that I’ve got the report working and updating on the intranet page every day my next task will be to come up with a more secure method using SSH or SCP, but this works well for now.

The big challenge in creating a bash script using FTP is figuring out how to pass the user id and password.  I tried various methods unsuccessfully and finally settled on using the .netrc file.  Create an empty file named .netrc in your home directory (in my case I put it in /home/nasadmin) with the following syntax:

machine <ftp_server_name> login <ftp_login_id> password <ftp_password>

Once that is created, you need to do a chmod 600 on the .netrc file in order for it to work.  If the permissions are not set to 600 on that file the auto-login to the FTP server will fail.

My next step was to create the script that sends the replication status report to the IIS web server:

#!/bin/bash
cd /home/nasadmin/scripts
ftp <ftp_server_name> <<SCRIPT
put <filename>.csv
quit
SCRIPT
 I always chmod the script with 755 and +X after creating it in vi.  The script always ran fine manually, but I struggled for a while getting it to work properly when run from crontab.  I figured out that you must cd to the correct directory in the script before you call the ftp command, if not you will get “file not found” errors when you run it.  I was always running it manually from within that directory, so I didn’t immediately catch that problem. 🙂

I then added the above script to crontab on the Celerra.  I run it at 6AM every morning with the following entry:

0 6 * * * /scripts/repl_status.sh

For those not familiar with cron, you can add an entry using “crontab -e”, and list your current entries with “crontab -l”.  The first two entries in the line “0 6” represent minutes and the hour of each day, in this case it will run at 6:00AM every day.

I have a download link to the csv file on my web page, and I also have a script on my web server that converts the csv file to HTML output with a perl script called csv2html.pl so the data can be easily viewed without having to download the csv and open it in excel.  You can find csv2html.pl easily with a google search, I’ve blogged about it in previous posts as well.

That’s it!  An easy way to automatically push your reports to another server from the Celerra.  Now that I have the transfer method down, I’ll be adding more daily reports in the near future.  If anyone has experience doing this type of transfer from a Celerra (or Linux server) to a windows server via SSH or SCP, please comment! 🙂

Testing Disaster Recovery for VNX VDM’s and CIFS servers

After spending a few days working on a DR test recovery, I thought I’d describe the process along with a few roadblocks that I hit along the way.  We had some specific requirements that had to be met, so I thought I’d share my experiences.  Our host site has a VNX5500 and our DR site has an NS-960, and we have Celerra Replicator configured to replicate the VDM and all of the production filesystems from one site to the other.

Here were my business requirements for this test:

  1. Replicate the VDM, production CIFS server and production filesystems from the host site to DR site.
  2. Fail over (or bring up a copy of) the VDM from the host site to the DR site, mounting the replicated VDM at the DR site.
  3. Fail over (or bring up a copy of) the production CIFS server at the DR site.
  4. Create R/W checkpoints of all replicated filesystems at DR site to allow for appropriate user and application testing.
  5. Share the R/W checkpoints of the replicated filesystems on the CIFS server at the DR site rather than the original replicated filesystems, so original replicated data is not touched and does not need to be replicated again after the test.

I started off by setting up replication jobs for our VDM and all filesystems.  Once those were complete (after several weeks of data transfers) I was ready to test.

Step 1: Replicate VDM and production filesystems

This post isn’t meant to detail the process of actually setting up the initial replications, just how to get the replicated data working and accessible at your DR site.  Setting up replication is a well documented procedure which can be reviewed in EMC’s guide “Using Celerra Replicator (V2)”, P/N 300-009-989.  Once the VDMs and filesystems are replicated, you’re ready for the next step.

Step 2: Bring up the VDM at the DR site

The first step in my testing requirements is to bring up the VDM at the DR site.

Failed attempt 1:

I initially created a new replication session for the VDM as I didn’t want to use the actual production VDM, as this is a DR test and not an actual disaster.

After replicating a new copy of the VDM, I attempted to load it in the CLI with the command below.  This must be done from the CLI as there is no option to do this step in Unisphere.

nas_server –vdm <VDMNAME> -setstate loaded

It failed with this error:

Error 12066: root_fs <VDMNAME> is the source or destination object of a file system and cannot be unmounted or is the source or destination object of a VDM replication session and cannot be unloaded.

It was pretty obvious here that you need to stop the replication first before you can load the VDM.  So, as a next step, I stopped the replication with a simple right click/stop on the source side and tried again.

It failed with this error:

Error 4038: <interface_name_1> <interface_name_2> : interfaces not available on server_2

So, it looks like the interface names need to be the same.  I didn’t really want to change the interface names if I didn’t have to, so I tried a different approach next.

Failed attempt 2:

I thought this time I’d create a blank VDM on the destination side first and replicate the host VDM to it, thinking it wouldn’t keep the interface name requirement, and I still wouldn’t have to stop replication on the actual prod VDM, as I didn’t really want to use that one in a test.

I did just that. I created a blank VDM on the DR side, then started a new replication session from the host side and chose it as the destination, making sure to choose the overwrite option when I replicated to it.  The replication was successful.  I stopped the replication on the source side after it was complete, and then attempted to load the new replicated VDM on the DR side.

Voila! It worked:

nas_server –vdm <VDMNAME> -setstate loaded
            id          =          10
            name    =          vdm_replica
            acl        =          0
            type      =          vdm
            server   =          server_2
            rootfs    =          root_fs_vdm_replica
            I18N     =          UNICODE        
            Status   :
            Defined=          enabled
            Actual  =          loaded,ready

Now that it was loaded up, it was time to move on to the next step and create the R/W checkpoints of the filesystems. This is where the process failed again.

After clicking on the drop down box for “Choose Data Mover”, I got this error:

 No file systems exist

 Query file systems vdm_replica: All. File system not found. 

I’m not sure why this failed, but since the VDM couldn’t find the filesystems it was time to try another approach again.

Successful attempt:

After my first two failures, it looked pretty obvious that I’d need to change the interface names and use the original replicated VDM.  Making a copy of the VDM to a blank VDM didn’t work because it couldn’t see the filesystems, and using the original requires the interface names to be the same.  The lesson learned here is to make sure you have matching ports on your host and DR Celerras, and use the same interface names.  If I had done that, my first attempt would have been succesful.

If the original VDM has four CIFS servers (each with it’s own interface) and the DR Celerra only has one port configured on the network, you’d be out of luck.  You wouldn’t have enough interfaces to rename them all to match, and you’d never be able to load your VDM.  The VDM’s only look for the name to be the same, NOT the IP’s.  The IP’s can be different to match your DR network, and the IP’s that are already assigned to the DR site interfaces will NOT change when you load the VDM.

In my case, the host Celerra has two CIFS servers, each with it’s own interface.  One is for production, one is for backups.

Here are the steps that worked for me:

  1. Stop the replication of the VDM (You will see it change to a ‘stopped’ state in Unisphere).
  2. Change the interface names on the DR side (changing IP’s is not necessary) to match the host side.
  3. Load the VDM with the command  nas_server –vdm <VDMNAME> -setstate loaded
  4. You will see the VDM status change from ‘unloaded’ to ‘OK’.

Step 3:  Bring up the CIFS server at the DR site

After you’ve completed the previous step, the VDM will be loaded using the same exact same interfaces as production, and the CIFS servers will be automatically created as well.  If a CIFS server uses cge1-0 on server_2 on the host side, it will now be set up with the same name using cge1-0 on server_2 on the destination (DR) side.

This would be very useful in a real disaster, but for this test I wanted to create an alternate CIFS server with a different IP as the domain controller, DNS servers, and IP range used at our DR site is different.  You could choose to use the same CIFS server that was replicated with the VDM, but for our test I decided to bring up an entirely new CIFS server.  We use DFS for access all of our shares in production, so the name of the CIFS server won’t matter for our testing purposes.  We would just need to update DFS with the new name on the DR network.

Here are the steps I took to bring up the CIFS server for DR:

  1. Gather IP information from the DR team.  Will need a valid IP and subnet mask for the new CIFS server.
  2. Verify IP config on new DR network.
    1. Check that the default route matches the DR network
    2. Check that the DNS server entries match the DNS servers on the DR network
  3. Verify that the Domain controller in the DR network is up and available
  4. Modify the interface of your choice with the correct IP information for the CIFS server.
  5. Create the CIFS server and join it to the DR active directory domain.
    1. If you need to test an AD account, use this command:
    2. server_cifssupport <vdm_name> -cred -name -domain

That’s it for this step.  The CIFS server was successfully joined to the domain and I was able to ping it from one of our previously recovered windows servers on the DR network.

Step 4: Create Read/Write checkpoints of all replicated filesystems

One of my business requirements for this test was to allow read/write access to the replicated filesystems without having to actually change the production data.  The easy way to accomplish this is to create a single read/write checkpoint (snapshot) of each filesystem.  To do this, go to the checkpoint area in Unisphere, click create, and select the “Writeable Checkpoint” checkbox when you create the checkpoint.  You can also script the process and run it from the CLI on the control station.

First, create each checkpoint with this command:

nas_ckpt_schedule -create <ckpt_fs_name> -filesystem <fs_name> -recurrence once

Second, create a read/write copy of each checkpoint with this command:

fs_ckpt <ckpt_fs_name> -name <r/w_ckpt_fs_name>-Create -readonly n 

I would recommend running these no more than two a time and letting them finish.  I’ve had issues in the past running dozens of checkpoint jobs at once that hang and never complete, requiring a reboot of the data mover to correct.

Step 5: Share the replicated filesystems on the DR CIFS server

Once all of the R/W checkpoints are created, they can be shared on the DR CIFS server with the same share names as the original production share names. This allows all of our recovered application and file servers to connect to the same names, simplifying the configuration of the test environment.

You can use a CLI command to export each r/w copy to share them on your CIFS Server:

server_export [vdm] -P cifs -name [filesystem]_ckpt1 -option netbios=[cifserver] [filesystem]_ckpt1_writeable1

Step 6: Cleanup

That’s it!  We had a successful DR test.  Once the test was complete, I peformed the following cleanup steps:

  1. Remove CIFS server shares
  2. Remove CIFS server
  3. Change interfaces on DR celerra back to their original names and IP’s.
  4. Unload the replicated VDM with this command:
    1. nas_server –vdm <VDMNAME> -setstate mounted
    2. Restart the VDM replication from the source

Celerra data mover performance and port configuration

I had a request to review my experience with data mover performance and port configuration on our production Celerras.  When I started supporting our Celerras I had no experience at all, so my current configuration is the result of trial and error troubleshooting and tackling performance problems as they appeared.

To keep this simple, I’ll review my configuration for a Celerra with only one primary data mover and one standby.  There really is no specific configuration needed on your standby data mover, just remember to perfectly match all active network ports on both primary and standby, so in the event of a failover the port configuration matches between the two.

Our primary data mover has two Ethernet modules with four ports each (for a total of eight ports).  I’ll map out how each port is configured and then explain why I did it that way.

Cge 1-0             Failsafe Config for Primary CIFS  (combined with cge1-1), assigned to ‘CIFS1’ prod file server.

Cge 1-1             Failsafe Config for Primary CIFS (combined with cge1-0), assigned to ‘CIFS1’ prod file server.

Cge 1-2             Interface configured for backup traffic, assigned to ‘CIFSBACKUP1’ server, VLAN 1.

Cge 1-3             Interface configured for backup traffic, assigned to ‘CIFSBACKUP2’ server. VLAN 1.

Cge 2-0             Interface configured for backup traffic, assigned to ‘CIFSBACKUP3’ server, VLAN 2.

Cge 2-1             Interface configured for backup traffic, assigned to ‘CIFSBACKUP4’ server, VLAN 2.

Cge 2-2             Interface configured for replication traffic, assigned to replication interconnect.

Cge 2-3             Interface configured for replication traffic, assigned to replication interconnect.

Primary CIFS Server – You do have a choice in this case to use either link aggregation or a fail safe network configuration.  Fail safe is an active/passive configuration.  If one port fails the other will take over.  I chose a fail safe configuration for several reasons, but there are good reasons to choose aggregation as well.  I chose fail safe primarily due to the ease of configuration, as there was no need for me to get the network team involved to make changes to our production switch (fail safe is configured only on the Celerra side), and our CIFS server performance requirements don’t necessitate two active links.  If you need the extra bandwidth, definitely go for aggregation.

I originally set up the fail safe network in an emergency situation, as the single interface to our prod CIFS server went down and could not be brought back online.  EMC’s answer was to reboot the data mover.  That fixed it, but it’s not such a good solution during the middle of a business day.

Backup Interfaces – We were having issues with our backups exceeding the time we had for our backup window.  In order to increase backup performance, I created four additional CIFS servers, all sharing the same file systems as production.  Our backup administrator splits the load on the four backup interfaces between multiple media servers and tape libraries (on different VLANs), and does not consume any bandwidth on the production interface that users need to access the CIFS shares.  This configuration definitely improved our backup performance.

Replication – All of our production file systems are replicated to another Celerra in a different country for disaster recovery purposes.   Because of the huge amount of data that needs to be replicated, I created two interfaces specifically for replication traffic.  Just like the backup interfaces, it separates replication traffic from the production CIFS server interface.  Even with the separate interfaces, I still have imposed a bandwidth limitation (no more than 50MB/s) in the interconnect configuration, as I need to share the same 100MB WAN link with our data domain for replication.

This configuration has proven to be very effective for me.  Our links never hit 100% utilization and I rarely get complaints about CIFS server performance.  The only real performance related troubleshooting I’ve had to do on our production CIFS servers has been related to file system deduplication, I’ve disabled it on certain file systems that see a high amount of activity.

Other thoughts about celerra configuration:

  1. We recently added a third data mover to the Celerra in our HQ data center because of the file system limitation on one data mover.  You can only have 2048 total filesystems on one data mover.  We hit that limitation due to the number of checkpoints that we keep for operational file restores.  If you make a checkpoint of one filesystem twice a day for a month, that would be 61 filesystems used against the 2048 total, which adds up quickly if you have a CIFS server filled with dozens of small shares.  I simply added another CIFS server and all new shares are now created on the new CIFS server.  The names and locations of the shares are transparent to all of our users as all file shares are presented to users with DFS links, so there were no major changes required for our Active Directory/Windows administrators.
  2. Use the Celerra monitor to keep an eye on CPU and Memory usage throughout the day.  Once you launch it from Unisphere, it runs independently of your Unisphere session (unisphere can be closed) and has a very small memory footprint on your laptop/PC.
  3. Always create your CIFS server on VDM’s, especially if you are replicating data for disaster recovery.   VDM’s are designed specifically for windows environments, allow for easy migration between data movers and allow for easy recreation of a CIFS server and it’s shares in a replication/DR scenario.  They store all the information for local groups, shares, security credentials, audit logs, and home directory info.  If you need to recreate a CIFS server from scratch, you’ll need to re-do all of those things from scratch as well.  Always use VDM’s!
  4. Write scripts for monitoring purposes.  I have only one running on my Celerras now that emails me a report of the status all replication jobs in the morning.  Of course, you can put any valid command into a bash script (adding a mailx command to email you the results), stick it in crontab, and away you go.

How to scrub/zero out data on a decommissioned VNX or Clariion

datawipe

Our audit team needed to ensure that we were properly scrubbing the old disks before sending our old Clariion back to EMC on a trade in.  EMC of course offers scrubbing services that run upwards of $4,000 for an array.  They also have a built in command that will do the same job:

navicli -h zerodisk -messner B E D
B Bus
E Enclosure
D Disk

usage: zerodisk disk-names [start|stop|status|getzeromark]

sample: navicli -h 10.10.10.10 zerodisk -messner 1_1_12

This command will write all zero’s to the disk, making any data recovery from the disk impossible.  Add this command to a windows batch file for every disk in your array, and you’ve got a quick and easy way to zero out all the disks.

So, once the disks are zeroed out, how do you prove to the audit department that the work was done? I searched everywhere and could not find any documentation from emc on this command, which is no big surprise since you need the engineering mode switch (-messner) to run it.  Here were my observations after running it:

This is the zeromark status on 1_0_4 before running navicli -h 10.10.10.10 zerodisk -messner 1_0_4 start:

 Bus 1 Enclosure 0  Disk 4

 Zero Mark: 9223372036854775807

 This is the zeromark status on 1_0_4 after the zerodisk process is complete:

(I ran navicli -h 10.10.10.10 zerodisk -messner 1_0_4 getzeromark to get this status)

 Bus 1 Enclosure 0  Disk 4

Zero Mark: 69704

 The 69704 number indicates that the disk has been successfully scrubbed.  Prior to running the command, all disks will have an extremely long zero mark (18+ digits), after the zerodisk command completes the disks will return either a 69704 or 69760 depending on the type of disk (FC/SATA).  That’s be best I could come up with to prove that the zeroing was successful.  Running the getzeromark option on all the disks before and after the zerodisk command should be sufficient to prove that the disks were scrubbed.

Strategies for implementing Multi-tiered FAST VP Storage Pools

After speaking to our local rep and attending many different classes at the most recent EMC World in Vegas, I came away with some good information and a very logical best practice for implementing multi-tiered FAST VP storage pools.

First and foremost, you have to use Flash.  High RPM Fiber Channel drives are neighter capactiy efficient or performance efficient, the highest IO data needs to be hosted on Flash drives.  The most effective split of drives in a storage pool is 5% Flash, 20% Fiber Channel, and 75% SATA.

Using this example, if you have an existing SAN with 167 15,000 RPM 600GB Fiber Channel Drives, you would replace them with 97 drives in the 5/20/75 blend to get the same capacity with much improved performance:

  • 25 200GB Flash Drives
  • 34 15K 600GB Fiber Channel Drives
  • 38 2TB SATA Drives

The ideal scenario is to implement FAST Cache along with FAST VP.  FAST Cache continously ensures that the hottest data is serverd from Flash Drives.  With FAST Cache, up to 80% of your data IO will come from Cache (Legacy DRAM Cache served up only about 20%).

It can be a hard pill to swallow when you see how much the Flash drives cost, but their cost is negated by increased disk utilization and reduction in the number of total drives and DAEs that you need to buy.   With all FC drives, disk utilization is sacrificed to get the needed performance – very little of the capacity is used, you just buy tons of disks in order to get more spindles in the raid groups for better performance.  Flash drives can achieve much higher utilization, reducing the effective cost.

After implementing this at my company I’ve seen dramatic performance improvements.  It’s an effective strategy that really works in the real world.

In addition to this, I’ve also been implementing storage pools in pairs of two, each sized identically.  The first pool is designated only for SP A, the second is for SPB.  When I get a request for data storage, in this case let’s say for 1 TB, I will create a 500GB LUN in the first pool on SP A, and a 500GB LUN in the second pool on SP B.  When the disks are presented to the host server, the server administrator will then stripe the data across the two LUNs.  Using this method, I can better balance the load across the storage processors on the back end.

Adding/Removing modules from a datamover

I recently had an issue where a brand new datamover installed by EMC would not allow me to make it a standby for the existing datamovers.  It turns out that the hardware (specifically the number of FC and ethernet interfaces) must match PRECISELY, the number of ports and the slots the modules are installed in have to match across all datamovers.

The new datamover that was installed had an extra 4 port ethernet module installed in it.  Below is the procedure I used to remove the module, including all the commands to take it down, reconfigure it, and bring it back up successfully.  Removing the extra module solved the problem, it matched the config of the others and allowed me to configure it as a standby.

First, log in to the CLI on the control station with root priviliges.  Next, just run the commands below in order.

Turn off connecthome and emails to avoid false alarms.
 /nas/sbin/nas_connecthome -service stop
 /nas/bin/nas_emailuser -modify -enabled no
 /nas/bin/nas_emailuser -info

Copy and paste this to save, it will list the current datamover config.
 nas_server -i -a

Run this to shut the datamover down.  Run getreason to verify when it’s down.
 server_cpu server_<x> -halt now
 /nasmcd/sbin/getreason

Remove/replace the module now.

Power the datamover back on.
 /nasmcd/sbin/t2reset pwron -s <slot number>

Watch getreason for status
 /nasmcd/sbin/getreason
(Wait for it to reboot and say ‘Hardware Misconfigured’)

Once it is in a ‘misconfigured’ state, run setup_slot to configure it:
 /nasmcd/sbin/setup_slot -i 4

Run this command to view the current hardware config, verify that your change was made:
 server_sysconfig server_4 -p

Restart connecthome and email services.
 /nas/sbin/nas_connecthome -service start -clear
 /nas/sbin/nas_connecthome -i
 /nas/bin/nas_emailuser -modify -enabled yes
 /nas/bin/nas_emailuser -info

That’s it!  your datamover has been updated and reconfigured.

VNX Root Replication Checkpoints

Where did all my savvol space go?  I noticed last week that some of my Celerra replication jobs had stalled and were not sending any new data to the replication partner.  I then noticed that the storage pool designated for checkpoints was at 100%.  Not good. Based on the number of file system checkpoints that we perform, it didn’t seem possible that the pool could be filled up already.  I opened a case with EMC to help out.

I learned something new after opening this call – every time you create a replication job, a new checkpoint is created for that job and stored in the savvol.  You can view these in Unisphere by changing the “select a type” filter to “all checkpoints including replication”.  You’ll notice checkpoints named something like root_rep_ckpt_483_72715_1 in the list, they all begin with root_rep.   After working with EMC for a little while on the case, he helped me determine that one of my replication jobs had a root_rep_ckpt that was 1.5TB in size.

Removing that checkpoint would immediately solve the problem, but there was one major drawback.  Deleting the root_rep checkpoint first requires that you delete the replication job entirely, requiring a complete re-do from scratch.  The entire filesystem would have to be copied over our WAN link and resynchronized with the replication partner Celerra.  That didn’t make me happy, but there was no choice.  At least the problem was solved.

Here are a couple of tips for you if you’re experiencing a similar issue.

You can verify the storage pool the root_rep checkpoints are using by doing an info against the checkpoint from the command line and look for the ‘pool=’ field.

nas_fs –list | grep root_rep  (the first colum in the output is the ID# for the next command)

nas_fs –info id=<id from above>

 You can also see the replication checkpoints and IDs for a particular filesystem with this command:

fs_ckpt <production file system> -list –all

You can check the size of a root_rep checkpoint from the command line directly with this command:

./nas/sbin/rootnas_fs -size root_rep_ckpt_883_72715_1

 

Use the CLI to determine replication job throughput

This handy command will allow you to determine exactly how much bandwidth you are using for your Celerra replication jobs.

Run this command first, it will generate a file with the stats for all of your replication jobs:

nas_replicate -info -all > /tmp/rep.out

Run this command next:

grep "Current Transfer Rate" /tmp/rep.out |grep -v "= 0"

The output looks like this:

Current Transfer Rate (KB/s)   = 196
 Current Transfer Rate (KB/s)   = 104
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 90
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 88
 Current Transfer Rate (KB/s)   = 94
 Current Transfer Rate (KB/s)   = 89
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 108
 Current Transfer Rate (KB/s)   = 91
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 118
 Current Transfer Rate (KB/s)   = 119
 Current Transfer Rate (KB/s)   = 112
 Current Transfer Rate (KB/s)   = 27
 Current Transfer Rate (KB/s)   = 136
 Current Transfer Rate (KB/s)   = 117
 Current Transfer Rate (KB/s)   = 242
 Current Transfer Rate (KB/s)   = 77
 Current Transfer Rate (KB/s)   = 218
 Current Transfer Rate (KB/s)   = 285
 Current Transfer Rate (KB/s)   = 287
 Current Transfer Rate (KB/s)   = 184
 Current Transfer Rate (KB/s)   = 224
 Current Transfer Rate (KB/s)   = 82
 Current Transfer Rate (KB/s)   = 324
 Current Transfer Rate (KB/s)   = 210
 Current Transfer Rate (KB/s)   = 328
 Current Transfer Rate (KB/s)   = 156
 Current Transfer Rate (KB/s)   = 156

Each line represents the throughput for one of your replication jobs.  Adding all of those numbers up will give you the amount of bandwidth you are consuming.  In this case, I’m using about 4.56MB/s on my 100MB link.

This same technique can of course be applied to any part of the output file.  If you want to know the estimated completion date of each of your replication jobs, you’d run this command against the rep.out file:

grep "Estimated Completion Time" /tmp/rep.out

That will give you a list of dates, like this:

Estimated Completion Time      = Fri Jul 15 02:12:53 EDT 2011
 Estimated Completion Time      = Fri Jul 15 08:06:33 EDT 2011
 Estimated Completion Time      = Mon Jul 18 18:35:37 EDT 2011
 Estimated Completion Time      = Wed Jul 13 15:24:03 EDT 2011
 Estimated Completion Time      = Sun Jul 24 05:35:35 EDT 2011
 Estimated Completion Time      = Tue Jul 19 16:35:25 EDT 2011
 Estimated Completion Time      = Fri Jul 15 12:10:25 EDT 2011
 Estimated Completion Time      = Sun Jul 17 16:47:31 EDT 2011
 Estimated Completion Time      = Tue Aug 30 00:30:54 EDT 2011
 Estimated Completion Time      = Sun Jul 31 03:23:08 EDT 2011
 Estimated Completion Time      = Thu Jul 14 08:12:25 EDT 2011
 Estimated Completion Time      = Thu Jul 14 20:01:55 EDT 2011
 Estimated Completion Time      = Sun Jul 31 05:19:26 EDT 2011
 Estimated Completion Time      = Thu Jul 14 17:12:41 EDT 2011

Very useful stuff. 🙂

 

Use the CLI to quickly determine the size of your Celerra checkpoint filesystems

Need to quickly figure out which checkpoint filesystems are taking up all of your precious savvol space?  Run the CLI command below.  Filling up the savvol storage pool can cause all kinds of problems besides failing checkpoints.  It can also cause filesystem replication jobs to fail.

To view it on the screen:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size

To save it in a file:

nas_fs -query:IsRoot==False:TypeNumeric==1 -format:’%s\n%q’ -fields:Name,Checkpoints -query:TypeNumeric==7 -format:’   %40s : %5d : %s\n’ -fields:Name,ID,Size > checkpoints.txt

vi checkpoints.txt   (to view the file)

Here’s a sample of the output:

UserFilesystem_01
ckpt_ckpt_UserFilesystem_01_monthly_001 :   836 : 220000
ckpt_ckpt_UserFilesystem_01_monthly_002 :   649 : 220000

UserFilesystem_02
ckpt_ckpt_UserFilesystem_02_monthly_001 :   836 : 80000
ckpt_ckpt_UserFilesystem_02_monthly_002 :   649 : 80000

The numbers are in MB.

 

Optimizing Java Memory for Navisphere / Unisphere

If you have a CLARiiON system with a large configuration in terms of disks, LUNs, initiator records, etc, you may experience a slowdown when managing the system with Navisphere or Unisphere.  If you increase the amount of memory that Java can use, you can significantly improve the response time when using the management console.

Here are the steps:

  1. Log in to the CLARiiON setup page (http://<clariion IP>/setup).  Go to Set Update Parameters > Update Interval.  Change it to 300.
  2. On the Management Server (or your local PC/laptop) go to Control Panel and launch the Java icon.
  3. Go to the Java tab and click view.
  4. Enter -Xmx128m under Java Runtime Parameter, which allocates 128MB for Java.  This number can be increased as you see fit, you may see better results with 512 or 1024.

Celerra Health Check with CLI Commands

Here are the first commands I’ll type when I suspect there is a problem with the Celerra, or if I want to do a simple health check.

1. <watch> /nas/sbin/getreason.  This will quickly give you the current status of each data mover. 5=up, 0=down/rebooting.  Typing watch before the command will run the command with continuous updates so you can monitor a datamover if you are purposely rebooting it.

10 – slot_0 primary control station
5 – slot_2 contacted
5 – slot_3 contacted

2. nas_server -list.  This lists all of the datamovers and their current state.  It’s a good way to quickly tell which datamovers are active and which are standby.

1=nas, 2=unused, 3=unused, 4=standby, 5=unused, 6=rdf

id      type  acl  slot groupID  state  name
1        1    0     2                         0    server_2
2        4    0     3                        0    server_3

3. server_sysstat.  This will give you a quick overview of memory and CPU utilization.

server_2 :
threads runnable = 6
threads blocked  = 4001
threads I/J/Z    = 1
memory  free(kB) = 2382807
cpu     idle_%   = 70

4. nas_checkup.   This runs a system health check.

Check Version:  5.6.51.3
Check Command:  /nas/bin/nas_checkup
Check Log    :  /nas/log/checkup-run.110608-143203.log

————————————-Checks————————————-
Control Station: Checking if file system usage is under limit………….. Pass
Control Station: Checking if NAS Storage API is installed correctly…….. Pass

5. server_log server_2.  This shows the current alert log.  Alert logs are also stored in /nas/log/webui.

6. vi /nas/jserver/logs/system_log.   This is the java system log.

7. vi /var/log/messages.  This displays system messages.

Easy File Extension filtering with EMC Celerra

Are your users filling up your CIFS fileserver with MP3 files?  Sick of sending out emails outlining IT policies, asking for their removal?  However your manage it now, the best way to avoid the problem in the first place is to set up filtering on your CIFS server file shares.

So, to use the same example, lets say you don’t want your users to store MP3 files on your \\PRODFILES\Public share.

1. Navigate to the \\PRODFILES\C$ administrative share.

2. Open the folder in the root directory called .filefilter

3. Create an empty text file called mp3@public in the .filefilter folder.

4. Change the windows security on the file to restrict access to certain active directory groups or individuals.

That’s it!  Once the file is created and security is set, users who are restricted by the file security will no longer be able to copy MP3 files to the public share.  Note that this will not remove any existsing MP3 files from the share, it will only prevent new ones from being copied.

A guide for troubleshooting CIFS issues on the Celerra

In my experience, every CIFS issue you may have will fall into 8 basic areas, the first five being the most common.   Check all of these things and I can almost guarantee you will resolve your problem. 🙂

1. CIFS Service.  Check and make sure the CIFS Service is running:  server_cifs server_2 -protocol CIFS -option start

2. DNS.  Check and make sure that your DNS server entries on the Celerra are correct, that you’re configured to point to at least two, and that they are up and running with the DNS Service running.

3. NTP.  Make sure your NTP server entry is correct on the Celerra, and that the IP is reachable on the network and is actively providing NTP services.

4. User Mapping.

5. Default Gateway.  Double check your default gateway in the Celerra’s routing table.  Get the network team involved if you’re not sure.

6. Interfaces.  Make sure the interfaces are physically connected and properly configured.

7. Speed/Duplex.  Make sure the speed and duplex settings on the Celerra match those of the switch port that the interfaces are plugged in to.

8. VLAN.  Double check your VLAN settings on the interfaces, make sure it matches what is configured on the connected switch.

A guide for troubleshooting CIFS issues on the Celerra

In my experience, every CIFS issue you may have will fall into 8 basic areas, the first five being the most common. Check all of these things and I can almost guarantee you will resolve your problem. 🙂

1. CIFS Service. Check and make sure the CIFS Service is running: server_cifs server_2 -protocol CIFS -option start

2. DNS. Check and make sure that your DNS server entries on the Celerra are correct, that you’re configured to point to at least two, and that they are up and running with the DNS Service running.

3. NTP. Make sure your NTP server entry is correct on the Celerra, and that the IP is reachable on the network and is actively providing NTP services.

4. User Mapping.

5. Default Gateway. Double check your default gateway in the Celerra’s routing table. Get the network team involved if you’re not sure.

6. Interfaces. Make sure the interfaces are physically connected and properly configured.

7. Speed/Duplex. Make sure the speed and duplex settings on the Celerra match those of the switch port that the interfaces are plugged in to.

8. VLAN. Double check your VLAN settings on the interfaces, make sure it matches what is configured on the connected switch.