Category Archives: Articles

A Roundup of Storage Startups

The enterprise storage market has been a hotbed of innovation and entrepreneurship in the last several years.  While the storage industry has consolidated with acquisitions (such as HP’s purchase of Nimble) or otherwise simply shutting down, there always seem to be new companies waiting in the wings to take over.

These latest new companies are hoping to prove that they have a new and better way to address the increasing challenges of managing the huge amount of data growth by implementing their own take on enterprise storage technology.  They all come to the market with both the hope and the potential to change how the enterprise business market stores their data.

An up to date list of storage startups is hard to maintain, as the ranks are growing fast and companies can appear seemingly out of nowhere.  This latest crop has some game changing ideas and I look forward to seeing how their technology will shape the future of enterprise storage.  Some have been around for several years and are starting to mature, others started less than a year ago.

Many new companies are hoping (betting?) that the market will see a need for a new data managing layer of software that provides improved management capabilities across  multiple silos of data both on-premises and in the cloud.  Some of the emerging suppliers for data software management are Actifio, Avere (now part of Microsoft), Catalogic, Cohesity, Delphix, Druva, Rubrik, Scality and Strongbox. I’m going to be focusing more on the hardware suppliers in this post, so let’s take a closer look at some of the rising stars in the enterprise storage market.  I’m going to take a look at 23 companies in total in this post, in no particular order:  E8 Storage, Igneous Systems, Komprise, Portworx, Primary Data, Reduxio, Talena, Alluxio, Aparavi, Attala Systems, Datera, Datrium, Elastifile, Morro Data, Excelero, Minio, Nyriad, ScaleFlux, StorageOS, Storj Labs, Vexata, Wasabi and WekaIO.

I have no affiliation with any of these hardware vendors, this is simply a compiled list I generated with some basic online research.  The data presented is based primarily on marketing information that I gathered.  For more detailed information I recommend reviewing their individual company websites, links are provided for all of them.

     E8 Storage | CEO: Zivan Ori

E8 Storage focuses on shared accelerated storage for data-intensive, top tier applications that require a large amount of IO. Their scalable solution is well suited for intense low latency workloads,  real time analytics, financial and trading applications, transactional processing and large scale file systems.  Their patented high performance shared NVMe storage solution delivers much higher performance, improved storage performance density, and lower costs when compared to legacy systems.   They promise NVMe performance without giving up reliability and availability.  The company is privately held and based in Santa Clara, CA with R&D in Tel Aviv, and they have channel partners in the US and in Europe.

Their hardware is built on industry standards, including converged ethernet with RDMA and standard 2.5″ NVMe SSDs.  Up to 96 host servers can connect to each storage controller, and each controller is concurrently linked to shared storage to deliver scalability into the petabytes.

Potential customers can purchase their software separately or as an integrated system if an appliance based solution is a better fit. Bought independently, their software allows the use of hardware from any vendor, as long as the vendor is on their pre qualified list.  It also allows businesses to take advantage of economies of scale within their own supply chains and purchase new units at a pace that suits their needs.

  Igneous Systems | CEO: Kiran Bhageshpur

Igneous Systems is a Seattle-based, venture-backed company that designed a secondary storage system designed to support massive file systems. Their Hybrid Storage Cloud solution provides enterprises a consolidated secondary storage tier with cloud support and scalability.  Igneous remotely manages all on-premises cloud infrastructure, which includes monitoring, troubleshooting, and non-disruptive software upgrades.

Their infrastructure scales from 100TB to 100PB and uses their RatioPerfect architecture, which consists of distributed nano-servers that make the infrastructure resistant to hardware failures. This cloud-like architecture enables Igneous to offer cloud economics in the enterprise data center.

Unlike traditional storage equipment, Igneous Hybrid Storage Cloud uses an integrated serverless environment designed for data centric applications.  It features integrated backup and archive applications that are designed to seamlessly integrate with enterprise NAS as well as tiering data to the cloud.  Integrated search capabilities are built directly into the infrastructure and therefore require no separate backup catalogs to manage. Igneous Hybrid Storage Cloud is specifically designed for massive file systems managing billions of files,  unlike legacy backup systems.

They provide easy to deploy  storage that is a cost effective alternative to cloud data storage.  They provide a managed hardware solution on-premises and look after everything from maintenance and provisioning to performance tuning. Their pricing model is based on consumption. With a background at EMC-Isilon, the Igneous team has a great deal of experience in building infrastructure for unstructured data.

They were recognized by Gartner as a 2017 “Cool Vendor in Storage Technologies”.

     Komprise | CEO: Kumar K. Goswami

Komprise aims to address the issues of storage sprawl and rising costs with storage analytics.  They contend that storage management requires getting as close as possible to realtime insight into what is happening, and their software addresses this by providing metrics together with analytic tools to build a variety of data policies.  They then manage data placement across storage tiers and multiple clouds. Their software allows for interactively modelling multiple scenarios before moving the configuration into production.

Their intelligent data management provides an alternative to more expensive solutions from larger, more established vendors. The company’s IDM platform enables customers to lower NAS costs and ongoing cloud operations by using analytics to intelligently automate archiving and disaster recovery. The  service also allows for transparent access of data across on-premises NAS storage and the cloud.

Their analytics processing identifies data that is most suited for the cloud and then transparently archives and replicates the data. User defined policies are automated to move and manage data across on-prem NAS storage tiers.

      Portworx | CEO: Murli Thirumale

Portworx provides storage for containers and brings persistent storage to all of the common container schedulers.  All of the most popular databases are supported in the container environment.  They are an early player in the persistent container storage field, but have signed up some big names like GE Digital and Lufthansa Systems.  They are betting on the recent trends to replace hypervisors with containers and see persistent storage as the wave of the future.

They provide scheduler integrated data services forproduction enterprise containers and allow users to deploy stateful containers on-prem, in the cloud, or in hybrid clouds.  In contrast to legacy storage that has container connectors built on key-value stores, they are designed and built for cloud-native applications, making container data more portable, persistent and protected.

Primary Data | CEO: Lance Smith

Primary Data’s storage is based on the idea of extensible metadata, using open ended tagging of data objects to control them (i.e. life-cycle management and priority of service), but they also add telemetry to the equation to allow real time automated data placement.

Parallel access to metadata and metrics processing has the effect of speeding up I/O performance, and they keep it cheap by implementing a “pay as you go” pricing model.  Their leadership team happens to include Steve Wozniak (how cool is that?), who is listed as as chief scientist. In 2017 they announced $40 million in new funding and a new version of their storage platform.

Reduxio | CEO: Mark Weiner

Reduxio’s TimeOS software delivers high performance enterprise storage solutions with unique data management capabilities.  They put data at the middle of their architecture and allow complete virtualization of all types of storage.

Their HX550 multi-tier storage solution with built-in BackDating allows customers to modernize and simplify their storage infrastructure and IT operations by deploying flash storage that is cost-effective and that can be used across all their applications.

Reduxio’s unified storage platform is designed to deliver near-zero RPO and RTO, while greatly simplifying the data protection process and providing built-in data replication for disaster recovery. The features in the TimeOS v3 released in June 2017 enabled a single platform for the end-to-end management of the life cycle of an application’s data.

They already have a global install base of more than 150 enterprise customers, many with multiple installed systems across a wide range of industrial sectors, including Managed Service Providers, Manufacturing, BioTech, Education, State and Local Government and Professional Services.  Their product seems to be catching on.

Talena | CEO: Srinivas Vadlamani 

Talena developed the industry’s fastest data backup and recovery solution with built-in machine intelligence to handle huge data sets  with mission-critical applications sitting on top of modern data platforms such as DataStax/Cassandra, Couchbase, Hadoop HBase/Hive, MongoDB and Vertica.  Talena takes advantage of machine learning to ensure data resiliency in the event of disasters. They have the ability to back up and recover petabyte-sized and larger data sets much faster than other solutions on the market, minimizing the impact of data loss and greatly reduce downtime. Their growing customer base includes leading Fortune 500 businesses in the retail, financial services and travel industries, among others. 

Targeting the big-data market, They provides backup, recovery, archiving and test data management for major unstructured databases.  Their key features include deduplication and replication control via user-defined policies. The technology supports data-masking algorithms to prevent data exposure as data is moved around or used in testing.

     Alluxio | CEO: Haoyuan Li

Alluxio (formerly known as Tachyon) provides virtual distributed storage for Big Data.  They aim to become the storage abstraction layer for Big Data in the same manner that Apache Spark became the computation layer. Their memory centric architecture allows developers to interact with a single storage layer API without worrying about the configurations and complexities of the underlying storage and file systems.

Alluxio is a virtual distributed storage layer between big data computation frameworks and underlying storage systems that delivers data at memory speed to any target framework from any storage system regardless of its location.  They aim to address the challenge of data locality.  While in-memory stoage is usually  viewed as cache, their technology allows for separation of the function layer from the persistent storage layer.  Organizations can run any big data framework (like Apache Spark) with any storage system or filesystem underneath (like S3, EMC, NetApp, OpenStack Swift, Red Hat GlusterFS, etc.), and run it on any storage media (DRAM, SSD, HDV, etc.), and with that they support a unified global namespace by virtualizing disparate storage systems.

The company was founded by the creators of the Alluxio open source project from UC Berkeley AMPLab.

Aparavi | Chairman:  Adrian Knapp

Aparavi offers cloud data protection and remote disaster recovery as a service. Their cloud-forward solution offers a RESTful API, a policy engine, an open data format, and a multi-tenant architecture.  Their technology can reduce a customer’s storage footprint compared to more traditional methods while making sure compliance policies are adhered to.  They aim to address the issues of evolving global regulations and the huge amounts of data now being generated with long-term data retention solutions across modern, multi-cloud architectures.

At their core, they aim to better prepare their customers to meet the challenges of long-term data retention across multi cloud architectures. They designed and built a new software-as-a-service platform from scratch to allow companies to protect data on-prem and in in the cloud.  They also aim to break the typical barriers of cost, vendor lock-in, complexity and regulatory compliance requirements that cause businesses problems when utilizing more conventional solutions.  The company is run by management and engineers with a ton of experience in data retention, and the issue they are attempting to resolve is something I’ve seen directly in the companies I’ve worked for.  They may have something here.

         Attala Systems | CEO: Taufik Ma

Attala offers high performance computing and primary cloud storage.  Their product utilizes a scale out fabric running on standard ethernet to interconnect servers and data nodes in a data center.  Because they focus on scale out cloud storage and use an FPGA based fabric, they  are able to effectively eliminate legacy storage management layers. They tout that their product provides over ten million IOPS per scale-out node with latencies as low as 16 microseconds.

The Attala fabric includes the Model HNA host PCIe adapters, providing full hardware emulation of NVMe SSDs, thus allowing their solution to expose pooled resources as virtual SSDs. The host OS, hypervisor or driver see the virtual SSDs as real SSDs using standard NVMe drivers, so they can be used with any OS, hypervisor, ore bare-metal provisioning software.

The software also offers a fully automated orchestration layer, where the fabric dynamically and securely attaches volumes from storage resources from across the network directly to bare-metal servers, virtual machines or containers. No host agents or other software is required, so deployment and maintenance of the system across heterogeneous environments is fairly simple.

Datera | CEO: Marc Fleischmann 

Datera aims to solve what they see as some of the biggest challenges in storage. Their key-value store approach uses NVDIMM to speed up write operations, coalesce writes, and provide a cache for reads. Their access protocol can aggregate massively parallel reads, and their software tools provide many of the same compression, snapshot and replication features offered by the bigger and more established storage vendors. The software also works with orchestration tools from VMware, OpenStack, and Docker.

The product is focused on DevOps and cloud native apps use cases.  It runs on x86 servers with flash, and there is iSCSI-based native integration with OpenStack, CloudStack, VMware vSphere and container orchestration platforms such as Docker, Kubernetes and Mesos.

Some of the key features of DEDF include a RESTful interface, API-first operations to provide web scale automation with full infrastructure programmability, policy based configuration, self service provisioning, a scale out model, a flash first design that delivers high efficiency and low latency, multi tenancy and QoS for cloud-native and traditional workloads, and heterogeneous component support for easily scaling across commodity x86 servers.

       Datrium | CEO: Brian Biles

Datrium offers stackable appliances that act as servers.  Each appliance has a flash cache and they are are linked to a back-end storage unit with larger hard drives that then serves as the primary storage. Enterprise features such as compression, deduplication and end-to-end encryption are included.   They also offer an advanced snapshot tool that includes a catalog of snapshots.  Their product takes a slightly different approach than a hyperconvergence vendor like Nutanix that only pools drives that are built in to its servers.

They see compute, primary storage, secondary storage and cloud storage all coming together in a configuration that is scalable and easy to manage without the need for a silo for each storage class. As most IO requests will utilize the on-board flash cache on their nodes, they can deliver excellent performance without ever having to go to the data nodes.

Compute nodes can be supplied by Datrium, or clients can also use their existing infrastructure.  As persistent data resides on the data nodes, compute nodes are stateless and can go offline without risking data loss or corruption.  They supports a wide variety of environments including vSphere 5.5-6.5, Red Hat 7.3, CentOS 7 1611, and Docker 1.2.

They have a very unique spin on convergence, and their DVX system really enhances storage efficiency, which is critical to getting the most out of the flash in the data nodes.

Elastifile | CEO: Amir Aharoni

Elastifile offers file storage and scale-out file storage.  Their product employs the Bizur consensus algorithm, a distributed metadata model using an adaptive data placement methodology to provide cloud enabled storage services capable of handling transactional workloads with very low latency.

Their software is designed to help large and midsize enterprises scale up through the cloud to thousands of nodes and millions of IOPs for their most mission critical workloads. It will run on any server and can use any type of flash (3D and TLC included).  They claim to bring flash performance to all enterprise applications while reducing the capex and opex of virtualized data centers, and simplify the adoption of hybrid cloud by extending file systems across on-prem and cloud deployments.

       Morro Data | CEO: Paul Tien

Morro Data offers file storage and hybrid cloud solutions.  Their CloudNAS service combines an on-prem cache with S3 or Backblaze cloud storage, designed to give small/midsize businesses an alternative to using local file servers.

What separates them from others is a global distributed file system that synchronizes customer data between one or more on-prem CacheDrive hardware appliances and public cloud storage. The CacheDrives store frequently accessed data on site for better performance.

Their CacheDrive also serves as a cloud storage gateway to improve the performance of file transfers to object storage in the cloud. It is also designed to optimize bandwidth in order to accommodate less than ideal connections.  Morro also supports key enterprise storage features like the more established vendors, like data encryption, compression, retention policies and data recovery.

CloudNAS is designed to be used as primary storage, with the master copy of the data stored in the cloud and synchronized to the CacheDrives at local sites.

Excelero | CEO: Lior Gal

Excelero is a software-defined storage technology startup, and their primary offering is their NVMesh Server SAN software. They designed the software to pool NVMe storage from multiple servers.  The pooled storage then offers very high performance and is intended to be used as primary storage.

Using an SDS approach with an NVMe over Fabrics mesh storage stack, they aim to address issues with hyperconverged infrastructure.  Accessing a drive utilizes the RDMA feature of NVMe over Fabrics, which results in very low latency and shifting the CPU load on the initiating system rather than the one holding the drive.

  Leonovus | Chair: Michael Gaffney

Lenovus’ product is an advanced blockchain storage and compute solution with a marketplace for cloud applications.  Leonovus has invested over twenty million dollars in the development of distributed compute and distributed storage technology. They have been several granted patents, numerous patent claims and patents pending. Their unique software-defined storage solution has strong intellectual property protection.

Their software defined object storage technology is designed for enterprise on-prem, hybrid or public cloud users that have governance, risk management and compliance requirements. The software is designed to run on existing storage hardware.

        Minio | CEO: Anand Babu Periasamy

Minio is an easy to deploy open-source object storage server that uses an Amazon S3 compatible API.  They develop software for cloud-native and containerized applications to help businesses with the management of the exponential growth of unstructured data.  They also support Amazon AWS compatible lambda functions to perform useful actions like thumbnail generation, metadata extraction and virus scanning.

Their open source based object storage server is primarily designed to be used for cloud applications and DevOps departments. Application developers can containerize storage, apps and security simultaneously and use the same resources. The server enables applications to manage large amounts of unstructured data and enables cloud and SaaS application developers to more quickly and easily use and implement emerging cloud hosting providers like Digital Ocean, Packet and Hyper.sh.  It has proven to be popular in the Docker, Mesos and Kubernetes communities because of its cloud native architecture.

    Nyriad | CEO: Matthew Simmons

The core of Nyriad’s platform is their “NSULATE” technology, which uses a GPU to perform the processing. GPUs are specialized for floating point calculations, and Nyriad uses that enhanced capability to generate parity calculations that would otherwise be impossible with a CPU or a RAID controller.

They claim that NSULATE can handle dozens of simultaneous device failures in real-time and maintain consistent I/O performance.  Using Netlist’s NVvault non-volatile DIMMs to create a Linux storage platform, It can scale up to millions of IOPs and allow data to be directly sent to the GPU for storage processing.  As it directly bypasses the Linux kernel, it can offer improved performance.

NSULATE technology also allows for compute and storage to coexist on the same node. The idea is to enable storage nodes to be configured for computation in order to speed up I/O-related code, which can accelerate applications that typically hit a brick wall with storage IO bottlenecks.

ScaleFlux | CEO: Hao Zhong

ScaleFlux’s Converged Cloud Subsystem (CCS) is a tightly unified software and flash based hardware subsystem solution that easily and cost effectively integrates into Big Data, scale out servers. CCS collapses the traditional scale up storage hierarchy that usually bottlenecks data movement and processing performance by enabling high density, commodity flash to be used as an extension to memory.  Deploying CCS throughout the entire data center infrastructure is designed to provide a significant boost to application performance while reducing data center TCO.

StorageOS | CEO: Chris Brandon

StorageOS is a software-based distributed storage platform designed to provider persistent container storage.  It’s available on commodity hardware, virtual machines or in the cloud.  With the addition of a 40MB container, developers can build scalable stateful containerized apps, with fast, highly available persistent storage.

StorageOS offers simple and automated block storage to stateless containers, allowing databases and other applications that need enterprise class storage functionality to run without the normal complexity and high cost.

They aim to provide an enterprise class storage offering that is simpler, faster, easier, and cheaper than legacy IT storage. They also aim to provide automated storage provisioning to containers which can be instantiated and torn down many thousands of times a day.

So, how does the product work?  It’s a very fast and easy process.  It installs as a container under Linux and it locates accessible storage; be it direct-attached, network-attached and cloud-attached, or connected nodes. That storage is then aggregated into a virtual multi-node pool of block storage. Volumes are then carved out for accessing containers, and are then thin provisioned, mounted, loaded up with a database and started.

        Storj Labs | CSO: Shawn Wilkinson

Storj is based on blockchain technology and peer to peer protocols to provide secure, private, and encrypted cloud storage. Basically, it’s an open source decentralized cloud storage platform utilizing blockchain technology, and I looked at them in my previous article Blockchain and Enterprise Storage, where I dove in to what exactly blockchain is, how it works, how it may be applied in the enterprise storage space, and how it’s already starting to be used in various global industries.  Storj uses the spare storage capacity of its community members to store data that has been shredded and encrypted.  From a blockchain perspective, Storj uses their own storage coin token that is used to buy and sell space on the network.

For potential data farmers that are looking to share storage capacity, they will verify the integrity of your storage with a challenge that performs a remote audit. As a distributed storage system, it is a highly-available solution with data sliced up into multiple segments that are stored redundantly across at least five different systems.  The distributed nature helps accelerate data access because data is retrieved from multiple sources simultaneously rather than just one.

    Vexata | CEO: Zahid Hussain

Vexata’s active data infrastructure solution aims to improve performance at scale for I/O intensive applications.   The system presents a block or unstructured data I/O interface to enable applications to access and update large volumes of data at high throughput and low latency, and can be deployed as a fully contained or cloud deployed solution.  Based on their VX-OS software, their SSD systems can be deployed in both enterprise and cloud data center environments.

Their file storage system OS is well suited for business critical enterprise data architectures, media and entertainment workflows, and high performance data analytics.  VX-OS  is a scalable, resilient file storage system that supports industry standard protocols (like NFSv3 and GPFS), while providing over 1M random file IOPS, 50GB/s read and 20GB/s write bandwidth, and up to 180TB of protected capacity. It also supports enterprise class features such as file-based snapshots/clones and replication, as well as data-at-rest encryption without the huge performance penalty.

   Wasabi | CEO: David Friend

Wasabi offers cloud based object storage as a service.  You can read more about Object storage in my Primer on Object Storage article, but in a nutshell it refers to a data storage approach that stores information as individual objects in digital buckets, as opposed to storing files in a hierarchical or block fashion.  They claim that their storage service is significantly faster and cheaper than competing products and offers the same levels of reliability, and that their service can read and write data more than six times as fast as Amazon’s S3, while maintaining 100% compatibility with the Amazon S3 API.  Their prices are also claimed to be around 1/5th the cost of S3, Microsoft Azure, and Google Cloud.

   WekaIO | CEO: Liran Zvibel

WekaIO’s core product is WekaIO Matrix, which is a cloud-native scalable file system the that provides all flash storage performance with the simplicity of NAS storage. WekaIO Matrix offers dynamic scaling of resources based on application requirements. It is a distributed global namespace file system that can scale to thousands of compute nodes and petabytes of storage, and also provides integrated tiering to the cloud.

Their software deploys on industry-standard commodity servers. They have reference architectures for HPE Apollo and Dell EMC servers, with Supermicro and Lenovo in the pipeline.  It runs on bare-metal servers, virtual machines or in containers. The software scales from 6-240 nodes with little to no latency impact.  As a frame of reference, they note that a 30-node cluster can address up to 2PB of storage with up to 1.8M IOPS and 60GBps of bandwidth.  They also position themselves as a file access cloud storage option that sidesteps the limitations of existing Amazon storage services.

Advertisements

Blockchain and Enterprise Storage

The biggest value of blockchain in enterprise storage will be what it enables, not what it is.  While it has yet to be fully embraced by the enterprise, blockchain is well poised to change enterprise IT much like open source software did 20+ years ago.  Interest is steadily rising, and there is evidence that businesses are starting to investigate how blockchain technology will integrate into their future business goals and objectives. In this post I’m going to dive in to what exactly blockchain is, how it works, how it may be applied in the enterprise storage space, and how it’s already starting to be used in various global industries.

What is Blockchain technology?

Blockchain is a distributed ledger that maintains a continuously growing number of data records and transactions. It is a chain of transaction blocks built in adherence to a defined set of rules. It allows organizations who don’t trust each other to agree on database updates. Rather than using a central third party or an offline reconciliation process, Blockchain uses peer-to-peer protocols. As a distributed database, Blockchain provides a near real-time, permanent record that’s replicated among the participants. Bitcoin, probably the most well-known cryptocurrency right now, was possible due to Blockchain. It’s the core of the Bitcoin payment system.

What are the main characteristics of Blockchain?

There are a defined set of characteristics that make blockchain what it is. It is both a network and a database. It has rules and built-in security and it maintains internal integrity and its own history. Let’s take a look at the main characteristics of blockchain.

1. Decentralized.  Blockchain is decentralized, there is no central authority required to approve transactions. It is a system of peer to peer validating nodes. Because there are no intermediaries, transactions are made directly and each node maintains the ledger of updates.

2. External clients manage changes.  Changes to the ledger are triggered by transactions proposed by external parties through clients. When triggered by transactions, blockchain participants execute business logic and follow consensus protocols to verify the results.

3. Shared and distributed publicity.  Participants in the ledger maintain the blocks. When consensus is reached under the network’s rules, transactions and their results are grouped into cryptographically secured, immutable data blocks that are appended to the ledger by each participant. All members of the blockchain network can see the same transaction history in the same order.

4. Trusted Transactions.  The nature of the network distribution requires nodes to come to a consensus that enables transactions to be carried out between unknown parties.

5. Secure Transactions.  Strong Cryptography is added to each block. In addition to all of its transactions and their results, each block includes a cryptographic hash of the previous block, which ensures that any tampering with a particular block is easily detected. Blockchain provides transaction and data security. The ledger is an unchangeable record. Posts to it cannot be revised or tampered with, even by database operators.

How Blockchain Works

Consensus in Blockchain

Consensus is at the heart of the blockchain. To keep the integrity of its database, a consensus protocol is used that considers that the longest chain is always the most trustworthy and nodes can only be allowed to blocks to the chain if they solve an arbitrary mathematical puzzle.   These rules define which changes are allowed to be made to the database, who may make them, and when they can be made. One of the most important aspects of the consensus protocol concerns the rules governing how and when blocks are added to the chain. This is vitally important as in order for blockchains to be useful, they must establish an unchangeable timeline of events which must be agreed upon by all nodes, so that all nodes can agree on the current state of the database.  The timeline cannot be subject to censorship, thus no single node may be entrusted with control over what enters it when.

Proof of Work is the original consensus protocol and is used by Bitcoin and Ethereum. Proof of Work is based on puzzles that are difficult to solve but have an easily verifiable solution.  It can be thought of like a jigsaw puzzle.  While many hours of effort may be required to piece a puzzle together, it takes only a momentary glance to see that it has been correctly assembled. With proof of work consensus, the effort required to solve a puzzle is the “work” and the solution is the “proof of work.”  The fact that the solution to the puzzle is known proves that someone did the work to find that solution.

Blockchains that utilize proof of work consensus require proof for each new block to be added to the chain, thus requiring work to be done to create new blocks. This work is frequently referred to as mining. Proof of Work consensus protocols state that the chain containing the most blocks is the correct chain because it contains the most work. Blockchains which use proof of work are regarded as secure timelines because if one node attempted to rewrite history by changing an old block, its change would invalidate the work on the block it changed and all blocks after it by making the proofs incorrect.   While experimentation with different consensus mechanisms continues,  proof of work is by far the most the widely adopted.  There are alternatives however, so let’s take a brief look at some of them.

Proof-of-Stake.  In proof of stake, participants are required to maintain stocks of the currency (or tokens) to use the system. Creators of a new block are chosen deterministically depending on their stake.

Proof-of-Activity.  In proof of activity,  proof of work and proof of stake are used at the same time to help alleviate the issue of hash rate escalation.  Hash Rate is the measuring unit that measures how much power Bitcoin network is consuming to be continuously functional.

Proof-of-Burn.   With proof of burn, instead of trying arbitrarily large numbers of hashes to answer a puzzle as done with the proof of work method, the system instead runs a lottery and the tokens are burned so a node can try to win a block.

Proof-of-Capacity.  Proof of capacity is similar to proof of stake, but it is measured in hardware capacity that is dedicated to the network.

Federated Byzantine Agreement.  This is designed for private, permissioned Blockchains (like Hyperledger) where good behavior is an expectation, it is designed with less resource intensive methods. This method offers more flexibility with trust because a fork can be agreed upon by its members.

How can Blockchain be used in Enterprise Storage?

Enterprises looking for data access speed, physical security of the files, and businesses that must adhere to strict regulatory requirements about access policies and in-country data location regulations may have trouble applying the technology. Blockchain doesn’t meet those requirements in a traditional sense, most notably because of the distributed nature of blockchain.  For enterprise environments with less stringent regulatory requirements, it could still be an attractive option. The main benefits relate to its redundancy and reduced cost. The cost savings could be the major driver toward this technology in the enterprise.  Let’s take a look at some of the primary benefits of adopting the technology in the enterprise.

The primary benefits of blockchain in the enterprise

1. Decentralization and Redundancy.  Amazon S3 achieves redundancy by spreading files through all of its regional data centers, which makes each data center a point of failure. On a decentralized blockchain where data is stored on many individual nodes across the globe, it is much more difficult to create disruptions.

2. Privacy.  No third party controls user data or has access to user files. Each node only stores encrypted fragments of user data and users control their own keys.

3. Huge cost reductions.  Blockchain storage costs around $2 per terabyte per month. In comparison, S3 hosting from Amazon can cost over $20 per month per terabyte.

4. The Bottom Line.  Companies are always looking to increase revenues, cut costs, and reduce risks. Blockchain technology has the potential to address those core, bottom line issues.

The Elements of Blockchain in the Enterprise

How can blockchain be implemented in an existing enterprise storage envinroment?  Steve Todd from Dell EMC started by defining the basic elements of blockchain and the questions that need to be asked, all of which need to be answered in order to implement blockchain solutions in the enterprise. I’ve copied his questions below. It’s very high level, but it’s a good start in establishing a baseline for an enterprise blockchain implementation.

1. New business logic.  What new business logic is being written, and what is it’s purpose? Will modern application development processes be used to develop the new logic? How will this code be deployed when compared against existing application deployment frameworks? Will your business logic be portable across blockchains?

2. Smart Contracts. How are smart contracts deployed compared to existing application deployment? Are these contracts secure (e.g. encrypted)? Are they well-written? How easy are they to consume? Do they lock-in application developers to a certain platform? Are metrics collected to measure usage? Are access attempts logged securely?

3. Cryptography. Given the liberal use of cryptography within blockchains, which libraries will be used within the underlying ledger? How are these libraries maintained and used across ledgers? what role does cryptography play in different consensus algorithms?

4. Identity / Key Management. The use of private and public keys in a blockchain is foundational. How are these keys managed in comparison to other corporate key management systems? How do corporate identities translate to shared identities with other nodes on a blockchain network?

5. Network Programmability.  How will the network between blockchain nodes be instantiated, tuned, and controlled? How will application SLAs for latency be translated into adequately-performing network operations? Will blockchain transactions be distributed as cleartext or encrypted?

6. Consensus Algorithms.  How will decisions be made to accept/reject transactions? What is the “speed to finality” of these decisions? What are the scalability limits of the consensus algorithm? How much fault tolerance is built into the consensus? How much does performance suffer when fault tolerance limits are reached?

7. Off-chain Storage.  What kind of data assets are recorded within the ledger? Are they consistently referenced? How are access permissions consistently enforced between the ledger and off-chain assets? Do all consensus nodes have the ability to verify all off-chain data assets?

8. Data Protection.  How is data consistency enforced within the ledger? Do corrupted transactions thrown an exception? How are corrupted transactions repaired? Does every consensus node always store every single transaction locally? Can deduplication or compression occur? Can snapshot copies of the ledger be created for analysis purposes?

9. Integration with Legacy.  Does the ledger and consensus engine exist on the same converged platform as other business logic? Will there be integration connectors that copy and/or transform the ledger for other purposes? Is the ledger accessible to corporate analytic workspaces?

10. Multi-chain.  how will the ledger interact with the reality of a multi-chain world (e.g. Quorum, Hyperledger, Ethereum, etc). How will the ledger interact with non-chain ledgers (e.g. Corda)? Will there be a common API to access different blockchains?

11. Cloud automation.  Can routine blockchain tasks be automated? Will cloud providers offer non-repudiation and/or blockchain governance? Can blockchain app developers execute test/dev processes in one cloud provider environment and then push to a (different) cloud production environment?

Blockchain Cloud Storage in the Marketplace

There are multiple blockchain powered distributed cloud storage offerings that I’m aware of, and there are likely more to come. These organizations are using blockchain technology to take advantage of the spare hard drive space it’s users to make decentralized competitors similar to Amazon Web Services and Dropbox.

• Storj
• Filecoin
• Sia
• MaidSafe
• Cryptyk

All of these options provide decentralized cloud-based storage. Customers who use their services allocate a portion of their local storage for cloud-based storage. It’s akin to a decentralized, blockchain-powered version of Amazon Web Services. They all show that a public ledger can be used to facilitate a distributed public cloud, but I think it’s unlikely to be used for mission critical enterprise storage in the near future, at least until some of the basic questions about the elements of blockchain in the enterprise are answered, as I detailed in the previous section.

As cloud based storage becomes more relevant over time, the number of blockchain solutions similar to these projects will surely increase. Blockchain’s decentralization, speed, and reliability give it an inherent advantage over centralized cloud services, as they require the storage of data in data centers with high costs and maintenance requirements. Blockchain technology will likely have an increasingly important role in decreasing costs and increasing the security and efficiency of the methods data storage is implemented.

Blockchain Storage Provider Operations

I thought it would be interesting to take a look at how these existing competitors implement blockchain and how they market their services.  In addition to the security benefits,  overall these decentralized cloud storage providers seem to be marketed as being inexpensive storage for general consumers. A terabyte of storage at Sia costs about $2 per month. Storj charges by gigabyte, starting at $0.015 per gigabyte per month.

Storj, Sia, MaidSafe and Filecoin are built with a proprietary storage marketplace where users can buy and sell storage space.  They all use mining to provide compute power for the network.

Filecoin miners are given token rewards for hosting files, but also must prove that they are continuously replicating the files for more secure storage. Miners are also rewarded for distributing content quickly as the miner that can do this the fastest ends up with the tokens. Filecoin and Sia both support smart contracts on the blockchain that set the consensus rules and requirements for storage, however Storj users pay only for what they consume.  Filecoin also aims to allow the exchange of its tokens with fiat currencies and other tokens via wallets and exchanges.

In Maidsafe’s network,  Safecoin is paid to the user as data is retrieved. It’s done in a lottery system where a miner is rewarded at random. The amount of Safecoin earned is directly linked to the resources they provide and how often their shared storage is available and online.  Maidsafe miners rent their unused compute resources to the SAFE network (capacity, CPU, and bandwidth) and are paid in Safecoin. The SAFE network also supports a marketplace in which Safecoin is used to access, with part of the payment going to the application’s developer.  Miners can also sell the coins that they earn for other digital currencies, and these transactions can happen either on the network or directly between individuals.

All of these service providers store data with erasure coding.  Files are split apart and distributed across many locations and servers, which eliminates the chance of a single point of failure causing catastrophic data loss. Filecoin uses the IPFS distributed web protocol, allowing nodes to continue to communicate even if the rest of the network goes down.

Business Benefits

Blockchain technology implementation can provide a lot of benefits, most notably that it provides for making interactions faster, safer and less expensive, ensuring data security.  Although blockchain technology is primarily associated with the financial industry, blockchain solutions have the potential to be a disruptive force in other businesses sectors as well.

At a high level, what are the main benefits of blockchain in a business environment?

Fewer Intermediaries.  Blockchain avoids centralized intermediaries by using a peer to peer business network.

Faster, More Automated Processes.  Businesses can automate their data exchange and the processes that depend on it and eliminate offline or batch reconciliation. Business can automatically trigger actions, events, and even payments based on preset conditions with the potential for dramatic performance improvements.

Reduced Costs.  Business can lower costs by accelerating transactions and eliminating settlement processes by using a trusted, shared fabric of common information instead of relying on centralized intermediaries or complex reconciliation processes.

Increased Visibility.  Businesses can gain near real-time visibility into their distributed transactions across their networks, and maintain a shared system of records.

Enhanced Security.  Businesses can reduce fraud while at the same time increase regulatory compliance with tamper-proof business-critical records. They can secure their data by using cryptographically linked blocks so that records cannot be altered without detection.

With that in mind, let’s consider the most likely scenarios for Blockchain implementation in business. How exactly is blockchain technology being used in the industry today, and how may it be used in the future?

Blockchain in the Energy industry

The German company Share&Charge and California based eMotorWerks announced they are testing the first phase of a peer-to-peer electric vehicle charging network with blockchain payments. The technology has been called an “AirBnB for EVs,” and will allow EV owners to rent out their charging stations, set their own prices and receive payments via Bitcoin. The technology aims to prove that blockchain technology can make sharing and payment easier and more efficient and at the same time decrease the range anxiety that EV drivers experience.

The companies say that the partnership is the first peer to peer charging network to use blockchain technology in North America. The new P2P network was made available in California starting in August 2017, and a planned expansion to other states is in the works.

Blockchain technology in Banking and Finance

Blockchain solutions are looking to revolutionize how we transfer funds in a business environment. As transactions within Blockchain occur without intermediaries or any kind of central authority, a direct payment flow between customers around the world is easily accomplished. Blockchain application development is booming as more and more startups attempt to innovate the payment chain. Abra, a good example of a recent Blockchain startup, offers a digital wallet mobile app using Bitcoin currency.   There is intense interest in Blockchain in the finanace sector.  A New York-based company that runs a consortium of banks (R3 CEV), has recently released a new version of its blockchain platform (Corda) that it hopes will make it easier for financial firms to use the technology.  Banks and other financial institutions have been investing in the technology for the past few years in the hope that it can be used to automate some of their back office processes such as securities settlement and regulatory reporting.

A report from Accenture claimed blockchain technology could potentially reduce infrastructure costs for eight of the world’s ten largest investment banks by an average of 30%, which would result in $8 to $12 billion in annual cost savings. The savings, according to Accenture, would come in replacing traditionally fragmented database systems that support transaction processing with blockchain’s distributed ledger system. That would allow banks to reduce or eliminate reconciliation costs and data quality.

In addition, Accenture, J.P. Morgan Chase and Microsoft were among 30 companies that announced the formation of the Enterprise Ethereum Alliance, aimed at creating a standard version of the platform for financial transaction processing and tracking.

Blockchain in the Insurance industry

Insurance interest in blockchain appears to be growing. Blockchain has the potential to vastly improve the nature of claims processing and fraud detection in the insurance industry.

Blockchain could reduce many of the typical issues involved with smart contracts. Insured individuals usually find insurance contracts long and confusing, and insurance companies are constantly battling fraud. Using blockchain and smart contracts, both sides could benefit from managing claims in a more responsive and transparent way, and recording and verifying contracts on the blockchain could be a great start. When claims are submitted, blockchain could ensure that only valid claims are paid as the network would know if there were multiple claims submitted for the same accident. When specific criteria are met, a blockchain could trigger payment of the claim without any human intervention, improving the time it takes to resolve claims.

Blockchain also has great potential to detect and prevent fraudulent activity as well. Because validation is at the core of blockchain technology’s decentralized repository, its historical record can independently verify the validity of customers, policies and all transactions.

In the summer of 2017, blockchain Firm Bitfury Teams with Insurance Broker Risk Cooperative. The Bitfury Risk Cooperative partnership seeks to leverage Bitfury’s expertise in blockchain applications across a range of sectors and Risk Cooperative’s insurance placement platform and partnership model with leading insurers to spur adoption of blockchain in the insurance space.

Blockchain perspectives in Supply Chain Management

Blockchain has the potential to transform the supply chain and disrupt the way we produce, market, purchase and consume goods. The added transparency and security to the supply chain will make huge improvments, making our economies safer and more reliable by promoting trust and preventing the implementation of questionable business practices.

Microsoft’s blockchain supply chain group, Project Manifest, is testing the ability to track inventory on cargo ships, trains and trucks using RFID tags that link back to blockchain technologies. Though Microsoft hasn’t shared many details about the project yet, it appears it is working with partners to track things like auto parts to address cross-industry supply chains, which are very complex.
IBM offers a service that allows customers to test blockchains in a secure cloud and track high-value items through complex supply chains. The service is being used by Everledger, a firm that is trying to use the blockchain to push transparency into the diamond supply chain. Finnish startup Kuovola Innovation is working on a blockchain solution that enables smart tendering across the supply chain.

Blockchain smart-contracts are being used to address everything from the shipment, to receipt of inventory between all parties in various supply chains. Doing so could reduce complexity and the number of counterfeit items that enter the supply chain.

Blockchain in the Healthcare Industry

There are plenty of opportunities to leverage blockchain technology in healthcare, from medical records to the pharmaceutical supply chain to smart contracts for payment distribution. While progress has been slow, there are innovations in the healthcare industry taking place.

MediLedger successfully brings pharmaceutical manufacturers and wholesalers who compete with each other to the same negotiating table. They designed and implemented a process for using blockchain technology to improve tracking and tracing capabilities for prescriptions. They also successfully developed a blockchain solution that allows full privacy with no leaking of business intelligence, while still allowing the capability of drug verification and provenance reporting.

Built to support the requirements of the U.S. Drug Supply Chain Security Act (DSCSA), MediLedger also outlines steps to build an electronic, interoperable system to identify and trace certain prescription drugs, meaning it successfully met not just the law, but the operational needs of industry.

Additional projects were kicked off by SimplyVitalHealth and Robomed, where they focused on developing an audit trail and smart contracts between healthcare providers and patients, respectively.

Blockchain solutions for Online Voting

Blockchain could be the missing link in the architecture of an effective and secure online voting system, and could resolve major issues related to the privacy, transparency, and security of online voting.

Using blockchain technology, we can make sure that those who are voting are who they say they are and are legally allowed to vote. We can also make voting online more accessible, as anyone who knows how to use a cell phone can understand the technology required for voting, all while making the election process more secure than it currently is and allowing greater participation for all legally-registered voters.

Sovereign was unveiled in September 2017 by Democracy Earth, a not-for-profit organisation in Palo Alto, California. It combines liquid democracy – which gives individuals more flexibility in how they use their votes – with blockchains, digital ledgers of transactions that keep cryptocurrencies like bitcoin secure. Sovereign’s developers hope it could signal the beginning of a democratic system that transcends national borders.

The basic concept of liquid democracy is that voters can express their wishes on an issue directly or delegate their vote to someone else they think is better-placed to decide on their behalf. In turn, those delegates can also pass those votes upwards through the chain. Crucially, users can see how their delegate voted and reclaim their vote to use themselves.  It sits on existing blockchain software platforms, such as Ethereum, but instead of producing units of cryptocurrency, it creates a finite number of tokens called “votes”. These are assigned to registered users who can vote as part of organisations who set themselves up on the network, whether that is a political party, a municipality, a country or even a co-operatively run company.

No knowledge of blockchains is required – voters simply use an app. Votes are then “dripped” into their accounts over time like a universal basic income of votes. Users can debate with each other before deciding which way to vote.

Blockchain usage in Stock Trading

Some of the most prominent stock exchanges are looking at ways to leverage blockchain to fundamentally overhaul traditional mechanisms. Blockchain could enable savings by reducing duplication of processes, settlement time, collateral requirements and operational overheads. This would minimize the need to set aside financial resources to cater to counterparty risks and achieve higher anti-money laundering standards and reduced risk exposure.

Nasdaq has been at the forefront of blockchain innovation. At the turn of 2015, Nasdaq unveiled the use of its Nasdaq Linq blockchain ledger technology to successfully complete and record private securities transactions for Chain.com—the inaugural Nasdaq Linq client. In May, Nasdaq and Citi announced an integrated payment solution using a distributed ledger to record and transmit payment instructions based on Chain’s blockchain technology. The technology overcomes challenges of liquidity in private securities by streamlining payment transactions between multiple parties.

The path to its adoption will require resolving issues such as scalability, common standards, regulation, and legislation. Blockchain could revolutionize the core infrastructure systems of capital markets around the globe, bringing in greater transparency and efficiency.

Storage Class Memory and Emerging Technologies

I mentioned in my earlier post, The Future of Storage Administration, that Flash will continue to dominate the industry and will be embraced by the enterprise, which I believe will drive newer technologies like NVMe and diminish older technologies like fiber channel.  While there is a lot of agreement over the latest storage technologies that are driving the adoption of flash in the enterprise, including the aforementioned NVMe technology, there doesn’t seem to be nearly as much agreement on what the “next big thing” will be in the enterprise storage space.  NVMe and NVMe-oF are definitely being driven by the trend towards the all-flash data center, and Storage Class Memory (SCM) is certainly a relevant trend that could be that “next big thing”.  Before I continue, what are NVMe, NVMe-oF and SCM?

  • NVMe is a protocol that allows for fast access for direct attached flash storage. NVMe is considered an evolutionary step toward exploiting the inherent parallelism built into SSDs.
  • NVMe-oF allows the advantages of NVMe to be used on a fabric connecting hosts with networked storage. With the increased adoption of low latency, high bandwidth network fabrics like 10GB+ Ethernet and InfiniBand, it becomes possible to build an infrastructure that extends the performance advantages of NVMe over standard fabrics to access low latency nonvolatile persistent storage.
  • SCM (Storage Class Memory) is a technology that places memory and storage on what looks like a standard DIMM board, which can be connected over NVMe or the memory bus.  I’ll dive in a bit more later on.

In the coming years, you’ll likely see every major storage vendor rolling out their own solutions for NVMe, NVMe-oF, and SCM.  The technologies alone won’t mean anything without optimization of the OS/hypervisor, drivers, and protocols, however. The NVMe software will need to be designed to take advantage of the low latency transport and media.

Enter Storage Class Memory

SCM is a hybrid memory and storage paradigm, placing memory and storage on what looks like a standard DIMM board.  It’s been gaining a lot of attention at storage industry conferences for the past year or two.  Modern solid-state drives are a compromise because they’re inherently all-flash and are still configured with all the bottlenecks of legacy standard drives even when bundled in to modern enterprise arrays.  SCM is not exactly memory and it’s not exactly storage.  It physically connects to memory slots in a mainboard just like traditional DRAM.  It is also a little bit slower than DRAM, but it is persistent, so just like traditional storage all content is saved after a power cycle.  Compared to flash SCM is orders of magnitude faster and offers equal performance gains on read and write operations.  In addition, SCM tiers are much more resilient and do not have the same wear pattern problems as flash.

A large gap exists between DRAM as a main memory and traditional SSD and HDD storage in terms of performance vs. cost, and SCM looks to address that gap.

The next-generation technologies that will drive SCM aim to be denser than current DRAM along with being faster, more durable, and hopefully cheaper than NAND solutions.  SCM, when connected over NVMe technology or directly on the memory bus, will enable device latencies to be about 10x lower than those provided by NAND-based SSDs.  SCM can also be up to 10x faster than NAND flash although at a higher cost than NAND-based SSDs. Similarly, NAND flash started out at least 10x more expensive than the dominant 15K RPM HDD media when it was introduced. Prices will come down.

Because the expected media latencies for SCM (<2us) are lower than the network latencies (<5us), SCM will probably end up being more common on servers rather than on the network.  Either way, SCM on a storage system will help accelerate metadata access and result in improvement of overall system performance.  Using NVMe-oF to provide low-latency access to networked storage SCM could potentially be used to create a different tier of network storage.

The SCM Vision

It sounds great, right?  The concept of Storage Class Memory has been around for a while, but it’s become a hard to reach albeit very desirable goal for storage professionals. The common vision seems to be a new paradigm where data can live in fast, DRAM-like storage areas in which data in memory is the center of the computer instead of the compute functions. The main problem with this vision is how we get the system and applications to recognize that something beyond just DRAM is available for use and that it can be used as either data storage or as persistent memory.

We know that SCM will allow for huge volumes of I/O to be served from memory and potentially stored in memory.  There will be fewer requirements needed to create multiple copies to protect against controller or server failure.  Exactly how this will be done remains to be seen, but there are obvious benefits from not having to continuously commit to slow external disks.  Once all the hurdles are overcome, SCM should have broad applicability in SSDs, storage controllers, PCI or NVMe boards and DIMMs.

Sofware Support

With SCM, applications won’t need to execute write IOs to get data into persistent storage. A memory level, zero copy operation moving data into XPoint will take care of that. That is just one example of the changes that systems and software will have to take on board when a hardware option like XPoint is treated as persistent storage-class memory, however.  Most importantly, the following must also be developed:

  • File systems that are aware of persistent memory must be developed
  • Operating system support for storage-class memory must be developed
  • Processors designed to use hybrid DRAM and XPoint memory must be developed

With that said, the industry is well on its way. Microsoft has added XPoint storage-class memory support into Windows Server 2016.  It provides zero-copy access and Direct Access Storage volumes, known as DAX volumes.  Red Hat Linux Operating system support is in place to use these devices as fast disks in sector mode with btt, and this usecase is fully supported in RHEL 7.3.

Hardware

SCM can be implemented with a variety of current technologies, notably Intel Optane, ReRAM, and NVDIMM-P.

Intel has introduced Optane brand XPoint SSDs and XPoint DIMMs, instead of the relatively slower PCIe bus used by the NVMe XPoint drives.

Resistive Random-Access Memory (ReRAM) is still an up-and-coming technology and comparable to Intel’s XPoint. It is currently under development by a number of companies and is a viable replacement for flash memory. The costs and performance of ReRAM are not currently at a level that makes the technology ready for the mass market. Developers of ReRAM technology all face similar challenges: overcoming temperature sensitivity, integrating with standard CMOS technology and manufacturing processes, and limiting the effects of sneak path currents, which would otherwise disrupt the stability of the data contained in each memory cell.

NVDIMM stands for “Nonvolatile Dual-Inline Memory Module.” The NVDIMM-P specification is being developed to support NAND flash directly on the host memory interface.  NVDIMMs use predictive software that allocates data in advance between DRAM and NAND.  NVDIMM-P is limited in that even though NAND flash is physically located at DIMM along with DRAM, the traditional memory hierarchy is still the same. The NAND implementation still works as a storage device and the DRAM implementation still works as main memory.

HP worked for years developing its Machine project.  Their effort revolved around memory-driven computing and an architecture aimed at big data workloads, and their goal was eliminating inefficiencies in how memory, storage, and processors interact.  While the project appears to now be dead, the technologies they developed will live on in current and future HP products. Here’s what we’ll likely see out of their research:

  • Now: ProLiant boxes with persistent memory for applications to use, using a mix of DRAM and flash.
  • Next year: Improved DRAM-based persistent memory.
  • Two-three years: True non-volatile memory (NVM) for software to use as slow but high volume RAM.
  • Three-Four years: NVM technology across many product categories.

SCM Use Cases

I think SCM’s possibly most exciting use case for high performance computing will be its use as nonvolatile memory that is tightly coupled to an application. SCM has the potential to dramatically affect the storage landscape in high performance computing, and application and storage developers will have fantastic opportunities to take advantage of this unique new technology.

Intel touts fast storage, cache, and extended memory as the primary use cases for their Optane product line.  Fast storage or cache refers to the tiering and layering which enable a better memory-to-storage hierarchy. The Optane product provides a new storage tier that breaks through the bottlenecks of traditional NAND storage to accelerate applications, and enable more work to get done per server. Intel’s extended memory use case describes the use of an Optane SSD to participate in a shared memory pool with DRAM at either the OS or application level enabling bigger memory or more affordable memory.

What the next generation of SCM will require is the industry coming together to agree on what we are all talking about and generate some standards.  Those standards will be critical to support innovation. Industry experts seem to be saying that the adoption of SCM will evolve around use cases and workloads, and task-specific, engineered machines that are built with real-time analytics in mind.  We’ll see what happens.

No matter what, new NVMe-based products coming out will definitely do a lot toward enabling fast data processing at a large scale, especially solutions that support the new NVMe-oF specification. SCM combined with software-defined storage controllers and NVMe-oF will enable users to pool flash storage drives and treat them as if they are one big local flash drive. Exciting indeed.

SCM may not turn out to be a panacea, and current NVMe flash storage systems will provide enough speed and bandwidth to handle the even the most demanding compute requirements for the foreseeable future.  I’m looking forward to seeing where this technology takes us.

The Future of Storage Administration

What is the future of enterprise storage administration? Will the job of enterprise storage administrator still be necessary in 10 years? A friend in IT storage management recently asked  me a question related to that very relevant topic. In a word, I believe the answer to that 2nd question is a resounding yes, but alas, things are changing and we are going to have to embrace the inevitable changes in the industry to stay on top of our game and remain relevant.  In recent years I’ve seen the need to broaden my skills and demonstrate how my skills can actually drive business value. The modern data center is undergoing a tremendous transformation, with hyper-converged systems, open source solutions, software-defined storage, large cloud-scale storage systems that companies can throw together all by themselves, and many others. This transformation is being created by the need for business agility, and it’s being fueled by software innovation.

As the expansion of data into the cloud influences and changes our day-to-day management, we will begin to see the rise of the IT generalist in the storage world.  These inevitable changes and the new tools that manage them will mean that storage will likely move toward being procured and managed by IT generalists rather than specialists like myself. Hyper converged infrastructures will allow these generalists to manage an entire infrastructure with a single, familiar set of tools.  As overall data center responsibilities start to shift to more generalist roles, traditional enterprise storage storage professionals like myself will need to expand our expertise beyond storage, or focus on more strategic projects where storage performance is critical.  I personally see us starting to move away from the day-to-day maintenance of infrastructure and more toward how IT can become an real driver of business value. The glory days of on-prem SAN and storage arrays are nearing an end, but us old timers in enterprise storage can still be a key part of the success of the businesses we work for. If we didn’t embrace change, we wouldn’t be in IT, right?

Despite all of these new technologies and trends, keep in mind that there are still some good reasons to take the classic architecture into consideration for new deployments. It’s not going to disappear overnight. It’s the business that drives the need for storage, and it’s the business applications that dictate the ideal architecture for your environment. Aside from the application, businesses will also be dependent on their existing in-house skills which will of course affect the overall cost analysis of embracing the new technologies, possibly pushing them off.

So, what are we in for? The following list summarizes my view on the key changes that I think we’ll see in the foreseeable future.  I’m guessing you’d see these (along with many others) pop up in pretty much any google search about the future of storage or storage trends, but these are the most relevant to what I’m personally witnessing.

  • The public cloud is an unstoppable force.  Embrace it as a storage admin or risk becoming irrelevant.
  • Hyper-converged systems will become more and more common and will driven by market demand.
  • Hardware commoditization will continue to eat away at the proprietary hardware business.
  • Storage vendors will continue to consolidate.
  • We will see the rise of RDMA in enterprise storage.
  • Open Source storage software will mature and will see more widespread acceptance.
  • Flash continues to dominate and will be embraced by the enterprise, driving newer technologies like NVMe and diminishing technologies like fiber channel.
  • GDPR will drive increase spending and overall focus on data security.
  • Scale out and object solutions will increasingly be more important.
  • Data Management and automation will increase in importance.

Cloud
I believe that the future of cloud computing is undeniably hybrid. The future data center will likely represent a combination of cloud based software products and on-prem compute, creating a hybrid IT solution that balances the scalability and flexibility associated with cloud, and the security and control you have with a private data center. With that said, I don’t believe that Cloud is a panacea as there are always concerns about security, privacy, backups, and especially performance. In my experience, when the companies I’ve worked for have directed us to investigate cloud options for specific applications, on-premises infrastructure costs less than public cloud in the long run. Even so, there is no doubting the inexorable shift of projects, infrastructure, and spending to the cloud, and it will affect compute, networking, software, and storage. I expect I’ll see more and more push to find more efficient solutions that offer lower costs, likely resulting in hybrid solutions. When moving to the cloud, monitoring consumption is the key to cost savings. Cost management tools from the likes of Cloudability, Cloud Cruiser and Cloudyn are available and well worth looking at.

I’ve also heard, “the cloud is already in our data center, it’s just private”. Contrary to popular belief, private clouds are not simply existing data centers running virtualized, legacy workrkloads. They are highly-modernized application and service environments running on true cloud platforms (like AWS or Azure) residing either on-prem or in a hybrid scenario with a hosting services partner. As we shift more of our data to the cloud, we’ll see industry demand for storage move from “just in case” storage (an upfront capex model) to “just in time” storage (an ongoing opex model). “Just in time” storage has been a running joke for years for me in the more traditional data center environments that I’ve been responsible for, alluding to the fact that we’d get storage budget approved, ordered and installed just days before reaching full capacity. That’s not what I’m referring to in this case… “Just in time” means online service providers are running at much higher asset utilization than the typical customer can add capacity in more granular increments. The migration to cloud allows for a much more efficient “just in time” model than I’m used to, and allows the switch to an ongoing opex model.

Hyper Converged
A hyper-converged infrastructure can greatly simplify the management of IT and yes, it could reduce the need for skilled storage administrators: the complexities of storage, servers and networking that require separate skills to manage are hidden ‘under the hood’ by that software layer, allowing it to be managed by staff with more general IT skills through a single administrative interface. Hyperconverged infrastructure is also much easier to scale and in smaller increments than traditional integrated systems. Instead of making major infrastructure investments every few years, businesses can simply add modules of hyperconverged infrastructure when they are needed.

It seems like an easy sell. It’s a data center in a box. Fewer components, a smaller data center footprint, reduced energy consumption, lower cooling requirements, reduced complexity, rapid deployment time, fewer high level skill requirements, and reduced cost. What else could you ask for?

As it turns out, there are issues. Hyper converged systems require a massive amount of interoperability testing, which means hardware and software updates take a very long time to be tested, approved and released. A brand new intel chipset can take half a year to be approved. There is a tradeoff between performance and interoperability. In addition, you won’t be saving any money over a traditional implementation, hyper-converged requires vendor lock-in, and performance and capacity must be scaled out at the same time. Even with those potential pitfalls, hyper converged systems are here to stay and will continue to be adopted at a fast pace in the industry. The Pros tend to outweigh the cons.

Hardware Commoditization
The commoditization of hardware will continue to eat away at proprietary hardware businesses. The cost savings from economies of scale always seem to overpower the benefits of specialized solutions. Looking at history, there has been a long a pattern of mass-market produced products that completely wipe out low-volume high-end products, even superior products. Open source software using off-the-shelf hardware will become more common as we move toward the commoditzation of storage.

I believe most enterprises in general lack the in-house talent required to combine third-party or open source storage software with commodity hardware in a way that can guarantee the scalability and resilience that would be required. I think we’re moving in that direction, but we’re not likely to see it become prevalent in enterprise storage soon.

The mix of storage vendors in typical enterprises is not likely to be radically changed anytime soon, but it’s coming. Startups, even with their innovative storage software, have to deal with concerns about interoperability, supportability and resilience, and those concerns aren’t going anywhere. While the endorsement of a startup by one of the major vendors could change that, I think the current largest players like Dell/EMC and NetApp might be apprehensive in accelerating the move to storage hardware commoditization.

Open Source
I believe that software innovation has decisively shifted to open source, and we’re seeing that more and more in the enterprise storage space. You can take a look at many current open source solutions in my previous blog post here. Moreover, I can’t think of a single software market that has a proprietary packaged software vendor that defines and leads the field. Open source allows fast access to innovative software at little or no cost, allowing IT organizations to redirect their budget to other new initiatives.

When Enterprise architecture groups look at open source solutions, which generally focus on which proprietary vendor they should lock themselves in to, are now faced with the onerous task of selecting the appropriate open source software components, figuring out how they’ll be integrated, and doing interoperability testing, all while ensuring that they are maintaining a reliable infrastructure to the business. As you might expect, implementing open source requires a much higher level of technical ability than traditional proprietary solutions. Having the programming knowledge to build a and support an open source solution is far different than operating someone else’s supported solution. I’m seeing some traditional vendors move to the “milk the installed base” strategy and stifle their own internal innovation. If we want to showcase ourselves as technology leaders, we’re going to have to embrace open source solutions, despite the drawbacks.

While open source software can increase complexity and include poorly tested features and bugs, the overall maturity and general usability of Open Source storage software has been improving in recent years. With the right staff, implementation risks can be managed. For some businesses, the cost benefits of moving to that model are very tangible. Open source software has become commonplace in the enterprise, especially in the Linux realm. Linux of course pretty much started the open source movement, followed by widely adopted enterprise applications like MySQL, Apache, Hadoop. Open source software can allow businesses to develop IT solutions to address challenges that are customized and innovative while at the same time bring down acquisition costs by using commodity hardware.

NVMe
Storage industry analysts have predicted the slow death of Fiber Channel based storage for a long time. I expect that trend to speed up, with the steadily increasing speed of standard Ethernet all but eliminating the need for proprietary SAN connections and the expensive Fibre Channel infrastructure that comes along with it. NVMe over ethernet will drive it. NVMe technology is a high performance interface for solid-state drives (SSDs) and predictably, it will be embraced by all-flash vendors moving forward.

All the current storage trends you’ve read around efficiency, flash, performance, big data, machine learning, object storage, hyper-converged infrastruture, etc. are all moving against the current Fibre Channel standard. Production deployments are not yet widespread, but it’s coming. It allows vendors and customers get the most out of flash (and other non-volatile memory) storage. The rapid growth of all-flash arrays has kept fiber channel alive because it typcially replaces legacy disk or hybrid fiber channel arrays.

Legacy Fiber Channel vendors like Emulex, QLogic, and Brocade have been acquired by larger companies so the larger companies can milk the cash flow from the expensive FC hardware before their customers convert to Ethernet. I don’t see any growth or innovation in the FC market moving forward.

Flash
In case you haven’t noticed, it’s near the end of 2017 flash has taken over. It was widely predicted, and from what I’ve seen personally, those predictions absolutely came true. While it still may not rule the data center overall, new purchases have trended that way for quite some time now. Within the past year the organizations I’ve worked for have completely eliminated spinning disk from block storage purchases, instead relying on the value propositions of all-flash with data reduction capabilities making up for the smaller footprint. SSDs are now growing in capacity faster than HDDs (15TB SSDs have been announced) and every storage vendor now has an all-flash offering.

Consolidate and Innovate
The environment for flash startups is getting harder because all the traditional vendors now offer their own all-flash options. There are still startups making exciting progress in NVMe over Fabrics, object storage, hyper-converged infrastructure, data classification, and persistent memory, but only a few can grow into profitability on their own. We will continue to see acquisitions of these smaller, innovative startups as the larger companies struggle to develop similar technologies internally.

RDMA
RDMA will continue to become more prevalent in enterprise storage, as it significantly boosts performance.. RDMA, or Remote Direct Memory Access, has actually been around in the storage arena for quite a while as a cluster interconnect and for HPC storage. Most high-performance scale-out storage arrays use DMA for their cluster communications. Examples inlcude Dell FluidCache for SAN, XtremIO, VMAX3, IBM XIV, InfiniDat, and Kaminario. In a microsoft blog I was reading, it showed 28% more throughput, realized by the reduced IO latency. It also illustrated that RDMA is more CPU efficient which leaves the CPU available to run more virtual machines. TCP/IP is of course no slouch and is absolutely still a viable deployment option. While not quite as fast and efficient as RDMA, it will remain well suited for organizations that lack the expertise needed for RDMA.

The Importance of Scale-Out
Scale-up storage is showing it’s age. If you’re reading this, you probably know that scale up is limited to the scalability limits of the storage controllers and has for years led to storage system sprawl. As we move into a multi data center architecture, especially in the world of object storage, clusters will be extended by adding nodes in different geographical areas. As object storage is geo aware (I am in the middle of a global ECS installation), policies can be established to distribute data into these other locations. As a user is accessing the storage the object storage system will return data from the node that provides the best response time to the user. As data storage needs continue to rapidly grow, it’s critical to move towards scale-out architecture vs. scale-up. The scalability that scale-out storage offers will help reduce costs, complexity, and resource allocation.

GDPR
The General Data Protection Regulation takes effect in 2018 and applies to any entity doing business within any EU country. Under the GDPR, companies will need to build controls around security roles and levels in regard to data access and data transfer, and must provide tight data-breach mechanisms and notification protocols. As process controls they probably will have little impact on your infrastructure, however the two main points within the GDPR that have the most potential for directly impacting storage are data protection by design and data privacy by default.

the GDPR is going to require you to think about the benefits of cloud vs on-prem solutions. Data will have to meet the principle of privacy by default, be in an easily portable format and meet the data minimization principle. Liability of the new regulation falls on all parties however, so cloud providers will have to provide robust compliance solutions in place as well, meaning it could be a simpler, less-expensive route to look at a cloud or hybrid solution in the future.

Machine Learning, Cognitive Computing, and the Storage Industry

In context with my recent posts about object storage and software defined storage, this is another topic that simply interested me enough to want to do a bit of research about the topic in general, as well as how it relates to the industry that I work in.  I discovered that there is a wealth of information on the topics of Machine Learning, Cognitive Computing, Artificial Intelligence, and Neural Networking, so much that writing a summary is difficult to do.  Well, here’s my attempt.

There is pressure in the enterprise software space to incorporate new technologies in order to keep up with the needs of modern businesses. As we move farther into 2017, I believe we are approaching another turning point in technology where many concepts that were previously limited to academic research or industry niches are now being considered for actual mainstream enterprise software applications.  I believe you’ll see Machine learning and cognitive systems becoming more and more visible in the coming years in the enterprise storage space. For the storage industry, this is very good news. As this technology takes off, it will result in the need to retain massive amounts of unstructured data in order to train the cognitive systems. Once machines can learn for themselves, they will collect and generate a huge amount of data to be stored, intelligently categorized and subsequently analyzed.

The standard joke about artificial intelligence (or machine learning in general) is that, like nuclear fusion, it has been the future for more than half a century now.  My goal in this post is to define the concepts, look at ways this technology has already been implemented, look at how it affects the storage industry, and investigate use cases for this technology.  I’m writing this paragraph before I start, so we’ll see how that goes. 🙂

 What is Cognitive Computing?

Cognitive computing is the simulation of human thought processes using computerized modeling (the most well know example is probably IBM’s Watson). It incorporates self-learning systems that use data mining, pattern recognition and natural language processing to imitate the way our brains process thoughts. The goal of cognitive computing is to create automated IT systems that are capable of solving problems without requiring human assistance.

This sounds like the stuff of science fiction, right? HAL (from the movie “2001 Space Odyssey”) came to the logical conclusion that his crew had to be eliminated. It’s my hope that intelligent storage arrays utilizing cognitive computing will come to the conclusion that 99.9 percent of stored data has no value and therefore should be deleted.  It would eliminate the need for me to build my case for archiving year after year. J

Cognitive computing systems work by using machine learning algorithms, they are inescapably linked. They will continuously gather knowledge from the data fed into them by mining data for information. The systems will progressively refine the methods the look for and process data until they become capable of anticipating new problems and modeling possible solutions.

Cognitive computing is a new field that is just beginning to emerge. It’s about making computers more user friendly with an interface that understands more of what the user wants. It takes signals about what the user is trying to do and provides an appropriate response. Siri, for example, can answer questions but also understands context of the question. She can ascertain whether the user is in a car or at home, moving quickly and therefore driving, or moving more slowly while walking. This information contextualizes the potential range of responses, allowing for increased personalization.

What Is Machine Learning?

Machine Learning is a subset of the larger discipline of Artificial Intelligence, which involves the design and creation of systems that are able to learn based on the data they collect. A machine learning system learns by experience. Based on specific training, the system will be able to make generalizations based on its exposure to a number of cases and will then be able to perform actions after new or unforeseen events. Amazon already use this technology, it’s part of their recommendation engine. It’s also commonly used by ad feed systems that provide ads based on web surfing history.

While machine learning is a tremendously powerful tool for extracting information from data, but it’s not a silver bullet for every problem. The questions must be framed and presented in a way that allows the learning algorithms to answer them. Because the data needs to be set up in the appropriate way, that can add additional complexity. Sometimes the data needed to answer the questions may not be available. Once the results are available, they also need to be interpreted to be useful and it’s essential to understand the context. A sales algorithm can tell a salesman what’s working the best, but he still needs to know how to best use that information to increase his profits.

What’s the difference?

Without cognition there cannot be good Artificial intelligence, and without Artificial Intelligence cognition can never be expressed. I Cognitive computing involves self-learning systems that use pattern recognition and natural language processing to mimic the way how the human brain works. The goal of cognitive computing is to create automated systems that are capable of solving problems without requiring human assistance. Cognitive computing is used in A.I. applications, hence Cognitive Computing is also actually subset of Artificial Intelligence.

If this seems like a slew of terms that all mean almost the same thing, you’d be right. Cognitive Computing and Machine Learning can both be considered subsets of Artificial Intelligence. What’s the difference between artificial intelligence and cognitive computing? Let’s use a medical example. In an artificial intelligence system, machine learning would tell the doctor which course of action to take based on its analysis. In cognitive computing, the system would provide information to help the doctor decide, quite possibly with a natural language response (like IBM’s Watson).

In general, Cognitive computing systems include the following ostensible characteristics:

  • Machine Learning
  • Natural Language Processing
  • Adaptive algorithms
  • Highly developed pattern recognition
  • Neural Networking
  • Semantic understanding
  • Deep learning (Advanced Machine Learning)

How is Machine Learning currently visible in our everyday lives?

Machine Learning has fundamentally changed the methods in which businesses relate to their customers. When you click “like” on a Facebook post your feed is dynamically adjusted to contain more content like that in the future. When you buy a Sony PlayStation on Amazon, and it recommends that you also buy an extra controller and a top selling game for the console, that’s their recommendation engine at work. Both of those examples use machine learning technology, and both affect most people’s everyday lives. Machine language technology delivers educated recommendations to people to help them make decisions in a world of almost endless choices.

Practical business applications of Cognitive Computing and Machine Learning

Now that we have a pretty good idea of what this all means, how is this technology actually being used today in the business world? Artificial Intelligence has been around for decades, but has been slow to develop due to the storage and compute requirements being too expensive to allow for practical applications. In many fields, machine learning is finally moving from science labs to commercial and business applications. With cloud computing and robust virtualized storage solutions providing the infrastructure and necessary computational power, machine learning developments are offering new capabilities that can greatly enhance enterprise business processes.

The major approaches today include using neural networkscase-based learninggenetic algorithmsrule induction, and analytic learning. The current uses of the technology combine all of these analytic methods, or a hybrid of them, to help guarantee effective, repeatable, and reliable results. Machine learning is a reality today and is being used very effectively and efficiently. Despite what many business people might assume, it’s no longer in its infancy. It’s used quite effectively across a wide array of industry applications and is going to be part of the next evolution of enterprise intelligence business offerings.

There are many other machine learning can have an important role. This is most notable in systems that with so much complexity that algorithms are difficult to design, when an application requires the software to adapt to an operational environment, or with applications that need to work with large and complex data sets. In those scenarios, machine learning methods play an increasing role in enterprise software applications, especially for those types of applications that need in-depth data analysis and adaptability like analytics, business intelligence, and big data.

Now that I’ve discussed some general business applications for the technology, I’ll dive in to how this technology is being used today, or is in development and will be in use in the very near future.

  1. Healthcare and Medicine. Computers will never completely replace doctors and nurses, but in many ways machine learning is transforming the healthcare industry. It’s improving patient outcomes and in general changing the way doctors think about how they provide quality care. Machine learning is being implemented in health care in many ways: Improving diagnostic capabilities, medicinal research (medicines are being developed that are genetically tailored to a person’s DNA), predictive analytics tools to provide accurate insights and predictions related to symptoms, diagnoses, procedures, and medications for individual patients or patient groups, and it’s just beginning to scratch the surface of personalized care. Healthcare and personal fitness devices connected via the Internet of Things (IoT) can also be used to collect data on human and machine behavior and interaction. Improving quality of life and people’s health is one of the most exciting use cases of Machine Learning technologies.
  2. Financial services. Machine Learning is being used for predicting credit card risk, managing an individual’s finances, flagging criminal activity like money laundering and fraud, as well as automating business processes like call centers and insurance claim processing with trained AI agents. Product recommendation systems for a financial advisor or broker must leverage current interests, trends, and market movements for long periods of time, and ML is well suited to that task.
  3. Automating business analysis, reporting, and work processes. Machine learning automation systems that use detailed statistical analysis to process, analyze, categorize, and report on their data exist today. Machine learning techniques can be used for data analysis and pattern discovery and can play an important role in the development of data mining applications. Machine learning is enabling companies to increase growth and optimize processes, increase customer satisfaction, and improve employee engagement.As one specific example, adaptive analytics can be used to help stop customers from abandoning a website by analyzing and predicting the first signs they might log off and causing live chat assistance windows to appear. They are also good at upselling by showing customers the most relevant products based on their shopping behavior at that moment. A large portion of Amazon’s sales are based on their adaptive analytics, you’ll notice that you always see “Customers who purchased this item also viewed” when you view an item on their web site.Businesses are presently using Machine learning to improve their operations in other many ways. Machine learning technology allows business to personalize customer service, for example with chatbots for customer relations. Customer loyalty and retention can be improved by mining customer actions and targeting their behavior. HR departments can improve their hiring processes by using ML to shortlist candidates. Security departments can use ML to assist with detecting fraud by building models based on historical transactions and social media. Logistics departments can improve their processes by allowing contextual analysis of their supply chain. The possibilities for the application of this technology across many typical business challenges is truly exciting.
  4. Playing Games. Machine learning systems have been taught to play games, and I’m not just talking about video games. Board game like Go, IBM’s Watson in games of Chess and Jeopardy, as well as in modern real time strategy video games, all with great success. When Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge of February 2011, showcasing Watson’s ability to learn, reason, and understand natural language with machine learning technology. In game development, Machine learning has been used for gesture recognition in Kinect and camera based interfaces, and It has also been used in some fighting style games to analyze the style of moves of the human to mimic the human player, such as the character ‘Mokujin’ in Tekken.
  5. Predicting the outcome of legal proceedings. A system developed by a team of British and American researchers was proven to be able to correctly predict a court’s decision with a high degree of accuracy. The study can be viewed here: https://peerj.com/articles/cs-93/. While computers are not likely to replace judges and lawyers, the technology could very effectively be used to assist the decision making process.
  6. Validating and Customizing News content. Machine learning can be used to create individually personalized news and screening and filtering out “fake news” has been a more recent investigative priority, especially given today’s political landscape. Facebook’s director of AI research Yann LeCun was quoted saying that machine learning technology that could squash fake news “either exists or can be developed.” A challenge aptly named the “Fake News Challenge” was developed for technology professionals, you can view their site http://www.fakenewschallenge.org/ for more information. Whether or not it actually works is dubious at the moment, but the application of it could have far reaching positive effects for democracy.
  7. Navigation of self-driving cars. Using sensors and onboard analytics, cars are learning to recognize obstacles and react to them appropriately using Machine Learning. Google’s experimental self-driving cars currently rely on a wide range of radar/lidar and other sensors to spot pedestrians and other objects. Eliminating some or all of that equipment would make the cars cheaper and easier to design and speed up mass adoption of the technology. Google has been developing its own video-based pedestrian detection system for years using machine learning algorithms. Back in 2015, its system was capable of accurately identifying pedestrians within 0.25 seconds, with 0.07-second identification being the benchmark needed for such a system to work in real-time.This is all good news for storage manufacturers. Typical luxury cars have up to around 200 GB of storage today, primarily for maps and other entertainment functionality. Self-driving cars will likely need terabytes of storage, and not just for the car to drive itself. Storage will be needed for intelligent assistants in the car, advanced voice and gesture recognition, caching software updates, and caching files to storage to reduce peak network bandwidth utilization.
  8. Retail Sales. Applications of ML are almost limitless when it comes to retail. Product pricing optimization, sales and customer service trending and forecasting, precise ad targeting with data mining, website content customization, prospect segmentation are all great examples of how machine learning can boost sales and save money. The digital trail left by customer’s interactions with a business both online and offline can provide huge amounts of data to a retailer. All of that data is where Machine learning comes in. Machine learning can look at history to determine which factors are most important, and to find the best way to predict what will occur based on a much larger set of variables. Systems must take into account today’s market trends not only for the past year, but for what happened as recently as 1 hour ago in order to implement real-time personalization. Machine learning applications can discover which items are not selling and pull them from the shelves before a salesperson notices, and even keep overstock from showing up in the store at all with improved procurement processes. A good example of the machine learning personalized approach to customers can be found once you get in the Jackets and Vests section of the North Face website. Click on “Shop with IBM Watson” and experience what is almost akin to a human sales associate helping you choose which jacket you need.
  9. Recoloring black and white images. Ted Turner’s dream come true. J Using computers to recognize objects and learn what they should look like to humans, color can be returned to both black and white pictures and video footage. Google’s DeepDream (https://research.googleblog.com/2015/07/deepdream-code-example-for-visualizing.html) is probably the most well-known example of one. It has been trained by examining millions of images of just about everything. It analyzes images in black and white and then colors them the way it thinks they should be colored. The “colorize” project is also taking up the challenge, you can view their progress at http://tinyclouds.org/colorize/ and download the code. A good online example is at Algorithmia, which allows you to upload and convert an image online. http://demos.algorithmia.com/colorize-photos/
  10. Enterprise Security. Security and loss of are major concerns for the modern enterprise. Some storage vendors are beginning to use artificial intelligence and machine learning to prevent data loss, increase availability and reduce downtime via smart data recovery and systematic backup strategies. Machine learning allows for smart security features to detect data and packet loss during transit and within data centers.Years ago it was common practice to spend a great deal of time reviewing security logs on a daily basis. You were expected to go through everything and manually determine the severity of any of the alerts or warnings as you combed through mountains of information. As time progresses it becomes more and more unrealistic for this process to remain manual. Machine learning technology is currently implemented and is very effective at filtering out what deviates from normal behavior, be it with live network traffic or mountains of system log files. While humans are also very good at finding patterns and noticing odd things, computers are really good at doing that repetitive work at a much larger scale, complementing what an analyst can do.Interested in looking at some real world examples of Machine Learning as it relates to security? There’s many out there. Clearcut is one example of a tool that uses machine learning to help you focus on log entries that really need manual review. David Bianco created a relatively simple Python script that can learn to find malicious activity in HTTP proxy logs. You can download David’s script here: https://github.com/DavidJBianco/Clearcut. I also recommend taking a look at the Click Security project, which also includes many code samples. http://clicksecurity.github.io/data_hacking/, as well as PatternEx, a SecOps tool that predicts cyber attacks. https://www.patternex.com/.
  11. Musical Instruments. Machine learning can also be used in more unexpected ways, even in creative outlets like making music. In the world of electronic music there are new synthesizers and hardware created and developed often, and the rise in machine learning is altering the landscape. Machine learning will allow instruments the potential to be more expressive, complex and intuitive in ways previously experienced only through traditional acoustic instruments. A good example of a new instrument using machine learning is the Mogees instrument. This device has a contact microphone that picks up sound from everyday objects and attaches to your iPhone. Machine learning could make it possible to use a drum machine then adapts to your playing style, learning as much about the player as the player learns about the instrument. Simply awe inspiring.

What does this mean for the storage industry?

As you might expect, this is all very good news for the storage industry and very well may lead to more and more disruptive changes. Machine learning has an almost insatiable appetite for data storage. It will consume huge quantities of capacity while at the same time require very high levels of throughput. As adoption of Cognitive Computing, Artificial Intelligence, and machine learning grows, it will attract a growing number of startups eager to solve the many issues that are bound to arise.

The rise of Machine learning is set to alter the storage industry in very much the same way that PC’s helped reshape the business world in the 1980’s. Just as PCs have advanced from personal productivity applications like Lotus 1-2-3 to large-scale Oracle databases, Machine learning is poised to evolve from consumer type functions like Apple’s Siri to full scale data driven programs that will drive global enterprises. So, in what specific ways is this technology set to alter and disrupt the storage industry? I’ll review my thoughts on that below.

  1. Improvements in Software-Defined Storage. I recently dove into Software defined storage in a blog post (https://thesanguy.com/2017/06/15/defining-software-defined-storage-benefits-strategy-use-cases-and-products/). As I described in that post, there are many use cases and a wide variety of software defined storage products in the market right now. Artificial Intelligence and machine learning will spark faster adoption of software-defined storage, especially as products are developed that use the technology to allow storage to be self-configurable. Once storage is all software-defined, algorithms can be integrated and far-reaching enough to process and solve complicated storage management problems because of the huge amount of data they can now access. This is a necessary step to build the monitoring, tuning, healing service abilities needed for self-driving software defined storage.
  2. Overall Costs will be reduced. Enterprises are moving towards cloud storage and fewer dedicated storage arrays. Dynamic software defined software that integrates machine learning could help organizations more efficiently utilize the capacity that they already own.
  3. Hybrid Storage Clouds. public vs. private clouds has been a hot topic in the storage industry, and with the rise of machine learning and software-defined storage it’s becoming more and more of a moot point. Well-designed software-defined architectures should be able to transition data seamlessly from one type of cloud to another, and machine learning will be used to implement that concept without human intervention. Data will be analyzed and logic engines will automate data movement. The hybrid cloud is very likely to flourish as machine learning technologies are adopted into this space.
  4. Flash Everywhere. Yes, the concept of “flash first” has been promoted for years now, and machine learning simply furthers that simple truth. The vast amount of data that machine learning needs to process will further increase the demand for throughput and bandwidth, and flash storage vendors will be lining up to fill that need.
  5. Parallel File Systems. Storage systems will have to deliver performance and throughput at scale in order to support machine learning technologies. Parallel file system can effectively reduce the problems of massive data storage and I/O bottlenecks. With its focus on high performance access to large data sets, parallel file systems combined with flash could be considered an entry point to full scale machine learning systems.
  6. Automation. Software-defined storage has had a large influence in the rise of machine learning in storage environments. Adding a heterogeneous software layer abstracted from the hardware allows the software to efficiently monitor many more tasks. The additional automation allows adminisrators like myself much more time for more strategic work.
  7. Neural Storage. Neural storage (“deep learning”) is designed to recognize and respond to problems and opportunities without any human intervention. It will drive the need for massive amounts of storage as it is utilized in modern businesses. It uses artificial neural networks, which are simplified computer simulations of how biological neurons behave to extract rules and patterns from sets of data. Unsurprisingly (based on it’s name) the concept is inspired by the way biological nervous systems process information. In general, think of of neural storage as many layers of processing on mountain-sized mounds of data. Data is fed through neural networks that are logical constructions that ask a series of binary true/false questions, or extract a numerical value of every bit of data which pass through them and classify it according to the answers that were tallied up. Deep Learning work is focused on developing these networks, which is why they became what are known as Deep Neural Networks (logic networks of the complexity needed to deal with classifying enormous datasets, think google-scale data). Using Google Images as an example, with datasets as massive and comprehensive as these and logical networks sophisticated enough to handle their classification, it becomes relatively trivial to take an image and state with a high probability of accuracy what it represents to humans.

How does Machine Learning work?

At its core, Machine learning works by recognizing patterns (such as facial expressions or spoken words), extracting insight from those patterns, discovering anomalies in those patterns, and then making evaluations and predictions based on those discoveries.

The principle can be summed up with the following formula:

Machine Learning = Model Representation + Parameter Evaluation + Learning & Optimization

Model Representation: The system that makes predictions or identifications. Includes the use of a object element represented in a formal language that a computer can handle and interpret.

Parameter Evaluation: A function needed to distinguish or evaluate the good and bad objects, the factors used by the model to form it’s decisions.

Learning & Optimization: The method used to search among these classifiers within the language to find the highest scoring ones. This is the learning system that adjust the parameters and looks at predictions vs. actual outcome.

How do we apply machine learning to a problem? First and foremost, a pattern must exist in the input data that would allow a conclusion to be drawn. To solve a problem with machine learning, the machine learning algorithm must have a pattern to deduce information from. Next, there must be a sufficient amount of data to apply machine learning to a problem. If there isn’t enough data to analyze, it will compromise the validity of the end result. Finally, machine learning is used to derive meaning from the data and perform structured learning to arrive at a mathematical approximation to describe the behavior of the problem. Therefore if the conditions above aren’t met, it will be a waste of time to apply machine learning to a problem through structured learning. All of these conditions must be met for machine learning to be successful.

Summary

Machines may not have reached the point where they can make full decisions without humans, but they have certainly progressed to the point where they can make educated, accurate recommendations to us so that we have an easier time making decisions. Current machine learning systems have delivered tremendous benefits by automating tabulation and harnessing computational processing and programming to improve both enterprise productivity and personal productivity.

Cognitive systems will learn and interact to provide expert assistance to scientists, engineers, lawyers, and other professionals in a fraction of the time it now takes. While they will likely never replace human thinking, cognitive systems will extend our cognition and free us to think more creatively and effectively, and be better problem solvers.

Open Source Storage Solutions

Storage solutions can generally be grouped into four categories: SoHo NAS systems, Cloud-based/object solutions, Enterprise NAS and SAN solutions, and Microsoft Storage Server solutions. Enterprise NAS and SAN solutions are generally closed systems offered by traditional vendors like EMC and NetApp with a very large price tag, so many businesses are looking at Open Source solutions to meet their needs. This is a collection of links and brief descriptions of Open Source storage solutions currently available. Open Source of course means it’s free to use and modify, however some projects have do commercially supported versions as well for enterprise customers who require it.

Why would an enterprise business consider an Open Source storage solution? The most obvious reason is that it’s free, and any developer can customize it to suit the needs of the business. With the right people on board, innovation can be rapid. Unfortunately, as is the case with most open source software, it can be needlessly complex and difficult to use, require expert or highly trained staff, have compatibility issues, and most don’t offer the support and maintenance that enterprise customers require. There’s no such thing as a free lunch, as they say, and using Open Source generally requires compromising on support and maintenance. I’d see some of these solutions as perfect for an enterprise development or test environment, and as an easy way for a larger company to allow their staff to get their feet wet in a new technology to see how it may be applied as a potential future solution. As I mentioned, tested and supported versions of some open source storage software is available, which can ease the concerns regarding deployment, maintenance and support.

I have the solutions loosely organized into Open Source NAS and SAN Software, File Systems, RAID, Backup and Synchronization, Cloud Storage, Data Desctruction, Distributed Storage/Big Data Tools, Document Management, and Encryption tools.

Open Source NAS and SAN Software Solutions

Backblaze

Backblaze is a object data storage provider. Backblaze stores data on its customized, open source hardware platform called Storage Pods, and its cloud-based Backblaze Vault file system. It is compatible with Windows and Apple OSes. While they are primarily an online backup service, they opened up their StoragePod design starting in 2009, which uses commodity hardware that anyone can build. They are self-contained 4U data storage servers. It’s interesting stuff and worth a look.

Enterprise Storage OS (ESOS)

Enterprise Storage OS is a linux distribution based on the SCST project with the purpose of providing SCSI targets via a compatible SAN (Fibre Channel, InfiniBand, iSCSI, FCoE). ESOS can turn a server with the appropriate hardware into a disk array that sits on your enterprise Storage Area Network (SAN) and provides sharable block-level storage volumes.

OpenIO 

OpenIOis an open source object storage startup founded in 2015 by CEO Laurent Denel and six co-founders. The product is an object storage system for applications that scales from terabytes to exabytes. OpenIO specializes in software defined storage and scalability challenges, with experience in designing and running cloud platforms. It owns a general purpose object storage and data processing solution adopted by large companies for massive production.

Open vStorage

Open vStorage is an open-source, scale-out, reliable, high performance, software based storage platform which offers a block & file interface on top of a pool of drives. It is a virtual appliance (called the “Virtual Storage Router”) that is installed on a host or cluster of hosts on which Virtual Machines are running. It adds value and flexibility in a hyper converged / Open Stack provider deployment where you don’t necessarily want to be tied to a solution like VMware VSAN. Being hypervisor agnostic is a key advantage of Open vStorage.

OpenATTIC

OpenATTIC is an Open Source Ceph and storage management solution for Linux, with a strong focus on storage management in a datacenter environment. It allows for easy management of storage resources, it features a modern web interface, and supports NFS, CIFS, iSCSI and FS. It supports a wide range of file systems including Btrfs and ZFS, as well as automatic data replication using DRBD, the distributed replicated block device and automatic monitoring of shares and volumes using a built-in Nagios/Icinga instance. openATTIC 2 will support managing the Ceph distributed object store and file system.

OpenStack

OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.

The OpenStack Object Storage (swift) service provides software that stores and retrieves data over HTTP. Objects (blobs of data) are stored in an organizational hierarchy that offers anonymous read-only access, ACL defined access, or even temporary access. Object Storage supports multiple token-based authentication mechanisms implemented via middleware.

CryptoNAS

CryptoNAS (formerly CryptoBox) is one NAS project that makes encrypting your storage quick and easy. It is a multilingual Debian based Linux live CD with a web based front end that can be installed into a hard disk or USB stick. CryptoNAS has various choices of encryption algorithms, the default is AES, it encrypts disk partitions using LUKS (Linux Unified Key setup) which means that any Linux operating system can also access them without using CryptoNAS software.

Ceph

Ceph is a distributed object store and file system designed to provide high performance, reliability and scalability. It’s built on the Reliable Autonomic Distributed Object Store (RADOS) and allows enterprises to build their own economical storage devices using commodity hardware. It has been maintained by RedHat since their acquisition of InkTank in April 2014. It’s capable of block, object, and file storage.  It is scale-out, meaning multiple Ceph storage nodes will present a single storage system that easily handles many petabytes, and performance and capacity increase simultaneously. Ceph has many basic enterprise storage features including replication (or erasure coding), snapshots, thin provisioning, auto-tiering and self-healing capabilities.

FreeNAS

The FreeNAS website touts itself as “the most potent and rock-solid open source NAS software,” and it counts the United Nations, The Salvation Army, The University of Florida, the Department of Homeland Security, Dr. Phil, Reuters, Michigan State University and Disney among its users. You can use it to turn standard hardware into a BSD-based NAS device, or you can purchase supported, pre-configured TrueNAS appliances based on the same software.

RockStor 

RockStor is a free and open source NAS (Network Attached Storage) solution. It’s Personal Cloud Server is a powerful local alternative to public cloud storage that mitigates the cost and risks of public cloud storage. This NAS and cloud storage platform is suitable for small to medium businesses and home users who don’t have much IT experience, but who may need to scale to terabytes of data storage.  If you are more interested in Linux and Btrfs, it’s a great alternative to FreeNAS. The RockStor NAS and cloud storage platform can be managed within a LAN or over the Web using a simple and intuitive UI, and with the inclusion of add-ons (fittingly named ‘Rockons’), you can extend the feature set of your Rockstor to include new apps, servers, and services.

Gluster

Red Hat-owned Gluster is a distributed scale-out network attached storaage file system that can handle really big data—up to 72 brontobytes.  It has found applications including cloud computing, streaming media services and content delivery networks. It promises high availability and performance, an elastic hash algortithm, an elastic volume manager and more. GlusterFS aggregates various storage servers over Ethernet or Infiniband RDMA interconnect into one large parallel network file system.

72 Brontobytes? I admit that I hadn’t seen that term used yet in any major storage vendor’s marketing materials. How big is that? Really, really big.

1 Bit = Binary Digit
8 Bits = 1 Byte
1,000 Bytes = 1 Kilobyte
1,000 Kilobytes = 1 Megabyte
1,000 Megabytes = 1 Gigabyte
1,000 Gigabytes = 1 Terabyte
1,000 Terabytes = 1 Petabyte
1,000 Petabytes = 1 Exabyte
1,000 Exabytes = 1 Zettabyte
1,000 Zettabytes = 1 Yottabyte
1,000 Yottabytes = 1 Brontobyte
1,000 Brontobytes = 1 Geopbyte

NAS4Free

Like FreeNAS, NAS4Free allows you to create your own BSD-based storage solution from commodity hardware. It promises a low-cost, powerful network storage appliance that users can customize to their own needs.

If FreeNAS and NAS4Free sound suspiciously similar, it’s because they share a common history. Both started from the same original FreeNAS code, which was created in 2005. In 2009, the FreeNAS team pursued a more extensible plugin architecture using OpenZFS, and a project lead who disagreed with that direction departed to continue his work using Linux, thus creating NAS4Free. NAS4Free dispenses with the fancy stuff and sticks with a more focused approach of “do one thing and do it well”. You don’t get bittorrent clients or cloud servers and you can’t make a virtual server with it, but many feel that NAS4Free has a much cleaner, more usable interface.

OpenFiler

Openfiler is a storage management operating system based on rPath Linux. It is a full-fledged NAS/SAN that can be implemented as a virtual appliance for VMware and Xen hypervisors. It offers storage administrators a set of powerful tools that are used to manage complex storage environments. It supports software and hardware RAID, monitoring and alerting facilities, volume snapshot and recovery features. Configuring Openfiler can be complicated, but there are many online resources available that cover the most typical installations. I’ve seen mixed reviews about the product online, it’s worth a bit of research before you consider an implementation.

OpenSMT

OpenSMT is an open source storage management toolkit based on opensolaris. Like Openfiler, OpenSMT also allows users to use commodity hardware for a dedicated storage device with NAS features and SAN features. It uses the ZFS filesystem and includes a well-designed Web GUI.

Open Media Vault

This NAS solution is based on Debian Linux and offers plug-ins to extend it’s capabilities. It boasts really easy-to-use storage management with a web based interface, fast setup, Multilanguage support, volume management, monitoring, UPS support, and statistics reporting. Plugins allow it to be extended with LDAP support, bittorrent, and iSCSI. It is primarily designed to be used in small offices or home offices, but is not limited to those scenarios.

Turnkey Linux

The Turnkey Linux Virtual Appliance Library is a free open source project which has developed a range of Debian based pre-packaged server software appliances (a.k.a. virtual appliances). Turnkey appliances can be deployed as a virtual machine (a range of hypervisors are supported), in cloud computing infrastructures (including AWS and others) or installed in physical computers.

Turnkey offers more than 100 different software appliances based on open source software. Among them is a file server that offers simple network attached storage, hence it’s inclusion in this list.

Turnkey file server is an easy to use file server that combines Windows-compatible network file sharing with a web based file manager. TurnKey File Server includes support for SMB, SFTP, NFS, WebDAV and rsync file transfer protocols. The server is configured to allow server users to manage files in private or public storage. It is based on Samba and SambaDAV.

oVirt

oVirt is free, open-source virtualization management platform. It was founded by Red Hat as a community project on which Red Hat Enterprise Virtualization is based. It allows centralized management of virtual machines, compute, storage and networking resources, from an easy to use web-based front-end with platform independent access. With oVirt, IT can manage virtual machines, virtualized networks and virtualized storage via an intuitive Web interface. It’s based on the KVM hypervisor.

Kinetic Open Storage

Backed by companies like EMC, Seagate, Toshiba, Cisco, NetApp, Red Hat, Western Digital, Dell and others, Kinetic is a Linux Foundation project dedicated to establishing standards for a new kind of object storage architecture. It’s designed to meet the need for scale-out storage for unstructured data. Kinetic is fundamentally a way for storage applications to communicate directly with storage devices over Ethernet. With Kinetic, storage use cases that are targeted consist largely of unstructured data like NoSQL, Hadoop and other distributed file systems, and object stores in the cloud like Amazon S3, OpenStack Swift and Basho’s Riak.

Storj DriveShare and MetaDisk

Storj (pronounced “Storage”) is a new type of cloud storage built on blockchain and peer-to-peer technology. Storj offers decentralized, end-to-end encrypted cloud storage. The DriveShare app allows users to rent out their unused hard drive space for use by the service, and the MetaDisk Web app allows users to save their files to the service securely.

The core protocol allows for peer to peer negotiation and verification of storage contracts. Providers of storage are called “farmers” and those using the storage, “renters”. Renters periodically audit whether the farmers are still keeping their files safe and, in a clever twist of similar architectures, immediately pay out a small amount of cryptocurrency for each successful audit. Conversely, farmers can decide to stop storing a file if its owner does not audit and pay their services on time. Files are cut up into pieces called “shards” and stored 3 times redundantly by default. The network will automatically determine a new farmer and move data if copies become unavailable. In the core protocol, contracts are negotiated through a completely decentralized key-value store (Kademlia). The system puts measures in place that prevent farmers and renters from cheating on each other, e.g. through manipulation of the auditing process. Other measures are taken to prevent attacks on the protocol itself.

Storj, like other similar services, offers several advantages over more traditional cloud storage solutions: since data is encrypted and cut into “shards” at source, there is almost no conceivable way for unauthorized third parties to access that data. Data storage is naturally distributed and this, in turn, increases availability and download speed thanks to the use of multiple parallel connections.

Open Source File Systems

Btrfs

Btrfs is a newer Linux filesystem being developed by Facebook, Fujitsu, Intel, the Linux Foundation, Novell, Oracle, Red Hat and some other organizations. It emphasizes fault tolerance and easy administration, and it supports files as large as 16 EiB.

It has been included in the Linux 3.10 kernel as a stable filesystem since July 2014. Because of the fast development speed, btrfs noticeably improves with every new kernel version, so it’s always recommended to use the most recent, stable kernel version you can. Rockstor always runs a very recent kernel for that reason.

One of the big draws of Btrfs is its Copy on Write (CoW) nature of the filesystem. When multiple users attempt to read/write a file, it does not make a separate copy until changes are made to the original file by the user. This has the benefit of saving changes, which allows file restorations with snaps. Btrfs also has its own native RAID support built in, appropriately named Btrfs-RAID. A nice benefit the Btrfs RAID iimplemenation is that a RAID6 volume does not need additional re-syncing upon creation of the RAID set, greatly reducing the time requirement.

Ext4

This is the latest version of one of the most popular filesystems for Linux. One of its key benefits is the ability to handle very large amounts of data— 16 TB maximum per file and 1 EB (exabyte, or 1 million terabytes) maximum per filesystem. It is the evolution of the most used Linux filesystem, Ext3. In many ways, Ext4 is a deeper improvement over Ext3 than Ext3 was over Ext2. Ext3 was mostly about adding journaling to Ext2, but Ext4 modifies important data structures of the filesystem such as the ones destined to store the file data.

GlusterFS

Owned by RedHat, GlusterFS is a scale-out distributed file system designed to handle petabytes worth of data. Features include high availability, fast performance, global namespace, elastic hash algorithm and an elastic volume manager.

GlusterFS combines the unused storage space on multiple servers to create a single, large, virtual drive that you can mount like a legacy filesystem using NFS or FUSE on a client PC. It also provides the ability to add more servers or remove existing servers from the storage pool on the fly. GlusterFS functions like a “network RAID” device, many RAID concepts are apparent during setup. It really shines when you need to store huge quantities of data, have redundant file storage, or write data very quickly for later access. Geo-replication lets you mirror data on a volume across the wire. The target can be a single directory or another GlusterFS volume.  It can also handle multiple petabytes easily along with being very easy to install and manage.

Lustre

Designed for “the world’s largest and most complex computing environments,” Lustre is a high-performance scale-out file system. It boasts that it can handle tens of thousands of nodes and petabytes of data with very fast throughput.

Lustre file systems are highly scalable and can be part of multiple computer clusters with tens of thousands of client nodes, multiple petabytes of storage on hundreds of servers, and more than 1TB/s of aggregate I/O throughput. This makes Lustre file systems a popular choice for businesses with large data centers.

OpenZFS

OpenZFS is an outstanding storage platform that encompasses the functionality of traditional filesystems, volume managers, and more, with consistent reliability, functionality and performance. This popular file system is incorporated into many other open source storage projects. It offers excellent scalability and data integrity, and it’s available for most Linux distributions.

IPFS

IPFS is short for “Interplanetary File System,” and is an unusual project that uses peer-to-peer technology to connect all computers with a single file system. It aims to supplement, or possibly even replace, the Hypertext Transfer Protocol that runs the web now. According to the project owner, “In some ways, IPFS is similar to the Web, but IPFS could be seen as a single BitTorrent swarm, exchanging objects within one Git repository.”

IPFS isn’t exactly a well-known technology yet, even among many in the Valley, but it’s quickly spreading by word of mouth among folks in the open-source community. Many are excited by its potential to greatly improve file transfer and streaming speeds across the Internet.

Open Source RAID Solutions

DRBD

DRBD is a distributed replicated storage system for the Linux platform. It is implemented as a kernel driver, several userspace management applications and some shell scripts. It is typically used in high availability (HA) computer clusters, but beginning with v9 it can also be used to create larger software defined storage pools with more of a focus on cloud integration. Support and training are available through the project owner, LinBit.

DRBD’s replication technology is very fast and efficient. If you can live with an active-passive setup, DRBD is an efficient storage replication solution. DRBD helps keep data synchronized between multiple nodes and multiple nodes in different datacenters, and if you need to failover between two nodes DRBD is very fast and efficient.

Mdadm

This piece of the Linux kernel makes it possible to set up and manage your own software RAID array using standard hardware. While it is terminal-based, but it offers a wide variety of options for monitoring, reporting, and managing RAID arrays.

Raider

Raider applies RAID 1, 4, 5, 6 or 10 to hard drives. It is able to convert a single linux system disk in to a software raid 1, 4, 5, 6 or 10 system in a two-pass simple command. Raider is a bash shell script, that deals with specific oddities of several linux distros (Ubuntu, Debian, Arch, Mandriva, Mageia, openSuSE, Fedora, Centos, PCLinuxOS, Linux Mint, Scientific Linux, Gentoo, Slackware… – see README) and uses linux software raid (mdadm) ( http://en.wikipedia.org/wiki/Mdadm and https://raid.wiki.kernel.org/ ) to execute the conversion.

Open Source Backup and Synchronization Solutions

Zmanda

From their marketing staff… “Zmanda is the world’s leading provider of open source backup and recovery software. Our open source development and distribution model enables us to deliver the highest quality backup software such as Amanda Enterprise and Zmanda Recovery Manager for MySQL at a fraction of the cost of software from proprietary vendors. Our simple-to-use yet feature-rich backup software is complemented by top-notch services and support expected by enterprise customers.”

Zmanda offers a community and enterprise edition of their software. The enterprise edition of course offers a much more complete feature set.

AMANDA

The core of Amanda is the Amanda server, which handles all the backup operations, compression, indexing and configuration tasks. You can run it on any Linux server as it doesn’t cause any conflicts with any other processes, but it is recommend to run it on a dedicated machine as that removes any associated processing loads from the client machines and prevents the backup from negatively affecting the client’s performance.

Overall it is an extremely capable file-level backup tool that can be customized to your exact requirements. While it lacks a GUI, the command line controls are simple and the level of control you have over your backups is exceptional. Because it can be called from within your own scripts, it can be incorporated into your own custom backup scheme no matter how complex your requirements are. Paid support and a cloud-based version are available through Zmanda, which is owned by Carbonite.

Areca Backup

Areca Backup is a free backup utility for Windows and Linux.  It is written in Java and released under the GNU General Public License. It’s a good option for backing up a single system and it aims to be simple and versatile. Key features include compression, encryption, filters and support for delta backup.

Backup

Backup is a system utility for Linux and Mac OS X, distributed as a RubyGem, that allows you to easily perform backup operations. It provides an elegant DSL in Ruby for modeling your backups. Backup has built-in support for various databases, storage protocols/services, syncers, compressors, encryptors and notifiers which you can mix and match. It was built with modularity, extensibility and simplicity in mind.

BackupPC

Designed for enterprise users, BackupPC claims to be “highly configurable and easy to install and maintain.” It backs up to disk only (not tape) and offers features that reduce storage capacity and IO requirements.

Bacula

Another enterprise-grade open source back solution, Bacula offers a number of advanced features for backup and recovery, as well as a fairly easy-to-use interface. Commercial support, training and services are available through Bacula Systems.

Back In Time

Similar to FlyBack (see below), Back in Time offers a very easy-to-configure snapshot backup solution. GUIs are available for both Gnome and KDE (4.1 or greater).

Backupninja

This tool makes it easier to coordinate and manage backups on your network. With the help of programs like rdiff-backup, duplicity, mysqlhotcopy and mysqldump, Backupninja offers common backup features such as remote, secure and incremental file system backups, encrypted backup, and MySQL/MariaDB database backup. You can selectively enable status email reports, and can back up general hardware and system information as well. One key strength of backupninja is a built-in console-based wizard (called ninjahelper) that allows you to easily create configuration files for various backup scenarios. The downside is that backupninja requires other “helper” programs to be installed in order to take full advantage of all its features. While backupninja’s RPM package is available for Red Hat-based distributions, backupninja’s dependencies are optimized for Debian and its derivatives. Thus it is not recommended to try backupninja for Red Hat based systems.

Bareos

Short for “Backup Archiving Recovery Open Sourced,” Bareos is a 100% open source fork of the backup project from bacula.org. The fork is in development since late 2010, it has a lot of new features. The source has been published on github, licensed AGPLv3. It offers features like LTO hardware encryption, efficient bandwidth usage and practical console commands. A commercially supported version of the same software is available through Bareos.com.

Box Backup

Box Backup describes itself as “an open source, completely automatic, online backup system.” It creates backups continuously and can support RAID. Box Backup is stable but not yet feature complete. All of the facilities to maintain reliable encrypted backups and to allow clients to recover data are, however, already implemented and stable.

BURP

BURP, which stands for “BackUp And Restore Program,” is a network backup tool based on librsync and VSS. It’s designed to be easy to configure and to work well with disk storage. It attempts to reduce network traffic and the amount of space that is used by each backup.

Clonezilla

Conceived as a replacement for True Image or Norton Ghost, Clonezilla is a disk imaging application that can do system deployments as well as bare metal backup and recovery. Two types of Clonezilla are available, Clonezilla live and Clonezilla SE (server edition). Clonezilla live is suitable for single machine backup and restore. While Clonezilla SE is for massive deployment, it can clone many (40+) computers simultaneously. Clonezilla saves and restores only used blocks in the hard disk. This increases the clone efficiency. With some high-end hardware in a 42-node cluster, a multicast restoring at rate 8 GB/min was reported.

Create Synchronicity

Create Synchronicity’s claim to fame is its lightweight size—just 220KB. It’s also very fast, and it offers an intuitive interface for backing up standalone systems. Create Synchronicity is an easy, fast and powerful backup application. It synchronizes files and folders, has a nice interface, and can schedule backups to keep your data safe. Plus, it’s open source, portable, multilingual, and very light (180kB). Windows 2000, Windows XP, Windows Vista, and Windows Seven are supported. To run Create Synchronicity, you must install the .Net Framework, version 2.0 or later.

DAR

AR is a command-line backup and archiving tool that uses selective compression (not compressing already compressed files), strong encryption, may split an archive in different files of given size and provides on-fly hashing. DAR knows how to perform full, differential, incremental and decremental backups. It provides testing, diffing, merging, listing and of course data extracting from existing archives. Archive internal’s catalog, allows very quick restoration of a even a single file from a very large, eventually sliced, compressed and encrypted archive. Dar saves *all* UNIX inode types, takes care of hard links, sparse files as well as Extended Attributes (MacOS X file forks, Linux ACL, SELinux tags, user attributes), it has support for ssh and is suitable for tapes and disks (floppy, CD, DVD, hard disks, …). An optional GUI is available from the DarGUI project.

DirSync Pro

DirSync Pro is a small, but powerful utility for file and folder synchronization. DirSync Pro can be used to synchronize the content of one or many folders recursively. Use DirSync Pro to easily synchronize files from your desktop PC to your USB-stick (/Externa HD/PDA/Notebook). Use this USB-stick (/Externa HD/PDA/Notebook) to synchronize files to another desktop PC. It also features incremental backups, a user friendly interface, a powerful schedule engine, and real-time synchronization. It is written in Java.

Duplicati

Duplicati is designed to backup your network to a cloud computing service like Amazon S3, Microsoft OneDrive, Google Cloud or Rackspace. It includes AES-256 encryption and a scheduler, as well as features like filters, deletion rules, transfer and bandwidth options. Save space with incremental backups and data deduplication. Run backups on any machine through the web-based interface or via command line interface. It has an auto-updater.

Duplicity

Based on the librsync library, Duplicity creates encrypted archives and uploads them to remote or local servers. It can use GnuPG to encrypt and sign archives if desired.

Duplicity backs directories by producing encrypted tar-format volumes and uploading them to a remote or local file server. Because duplicity uses librsync, the incremental archives are space efficient and only record the parts of files that have changed since the last backup. Because duplicity uses GnuPG to encrypt and/or sign these archives, they will be safe from spying and/or modification by the server.

The duplicity package also includes the rdiffdir utility. Rdiffdir is an extension of librsync’s rdiff to directories—it can be used to produce signatures and deltas of directories as well as regular files. These signatures and deltas are in GNU tar format.

FlyBack

Similar to Apple’s TimeMachine, FlyBack provides incremental backup capabilities and allows users to recover their systems from any previous time. The interface is very easy to use, but little customization is available. FlyBack creates incremental backups of files, which can be restored at a later date. FlyBack presents a chronological view of a file system, allowing individual files or directories to be previewed or retrieved one at a time. Flyback was originally based on rsync when the project began in 2007, but in October 2009 it was rewritten from scratch using Git.

FOG

An imaging and cloning solution, FOG makes it easy for administrators to backup networks of all sizes. FOG can be used to image Windows XP, Vista, Windows 7 and Window 8 PCs using PXE, PartClone, and a Web GUI to tie it together. Includes featues like memory and disk test, disk wipe, av scan & task scheduling.

FreeFileSync

FreeFileSync is a free Open Source software that helps you synchronize files and synchronize folders for Windows, Linux and Mac OS X. It is designed to save your time setting up and running data backups while having nice visual feedback along the way. This file and folder synchronization tool can be very useful for backup purposes. It can save a lot of time and receives very good reviews from its users.

FullSync

FullSync is a powerful tool that helps you keep multiple copies of various data in sync. I.e. it can update your Website using (S)Ftp, backup your data or refresh a working copy from a remote server. It offers flexible rules, a scheduler and more. Built for developers, FullSync offers synchronization capabilities suitable for backup purposes or for publishing Web pages. Features include multiple modes, flexible tools, support for multiple file transfer protocols and more.

Grsync

Grsync provides a graphical interface for rsync, a popular command line synchronization and backup tool. It’s useful for backup, mirroring, replication of partitions, etc. It’s a hack/port of Piero Orsoni’s wonderful Grsync – rsync frontend in GTK – to Windows (win32).

LuckyBackup

Award-winning LuckyBackup offers simple, fast backup. Note that while it is available in a Windows version, it’s still under development. It features Backup using snapshots, Various checks to keep data safe, Simulation mode, Remote connections, Easy restore procedure, Add/remove any rsync option, Synchronize folders, Exclude data from tasks, Execute other commands before or after a task, Scheduling, Tray notification support, and e-mail reports.

Mondo Rescue

Mondo Rescue is a GPL disaster recovery solution. It supports Linux (i386, x86_64, ia64) and FreeBSD (i386). It’s packaged for multiple distributions (Fedora, RHEL, openSuSE, SLES, Mandriva, Mageia, Debian, Ubuntu, Gentoo). It supports tapes, disks, network and CD/DVD as backup media, multiple filesystems, LVM, software and hardware Raid, BIOS and UEFI.

Obnam

Winner of the most original name for backup software – “OBligatory NAMe”. This app performs snapshot backups that can be stored on local disks or online storage services. Features include Easy usage, Snapshot backups, Data de-duplication, across files, and backup generations, Encrypted backups, and it supports both PUSH (i.e. Run on the client) and PULL (i.e. Run on the server) methods.

Partimage

Partimage is opensource disk backup software. It saves partitions having a supported filesystem on a sector basis to an image file. Although it runs under Linux, Windows and most Linux filesystems are supported. The image file can be compressed to save disk space and transfer time and can be split into multiple files to be copied to CDs or DVDs. Partitions can be saved across the network using the partimage network support, or using Samba / NFS (Network File Systems). This provides the ability to perform an hard disk partition recovery after a disk crash. Partimage can be run as part of your normal system or as a stand-alone from the live SystemRescueCd. This is helpful when the operating system cannot be started. SystemRescueCd comes with most of the data recovery software for linux that you may need .

Partimage will only copy data from the used portions of the partition. (This is why it only works for supported filesystem. For speed and efficiency, free blocks are not written to the image file. This is unlike other commands, which also copy unused blocks. Since the partition is processed on a sequential sector basis disk transfer time is maximized and seek time is minimized, Partimage also works for very full partitions. For example, a full 1 GB partition may be compressed down to 400MB.

Redo

Easy rescue system with GUI tools for full system backup, bare metal recovery, partition editing, recovering deleted files, data protection, web browsing, and more. Uses partclone (like Clonezilla) with a UI like Ghost or Acronis. Runs from CD/USB.

Rsnapshot

Rsnapshot is a filesystem snapshot utility for making backups of local and remote systems. Using rsync and hard links, it is possible to keep multiple, full backups instantly available. The disk space required is just a little more than the space of one full backup, plus incrementals. Depending on your configuration, it is quite possible to set up in just a few minutes. Files can be restored by the users who own them, without the root user getting involved. There are no tapes to change, so once it’s set up, you may never need to think about it again. rsnapshot is written entirely in Perl. It should work on any reasonably modern UNIX compatible OS, including: Debian, Redhat, Fedora, SuSE, Gentoo, Slackware, FreeBSD, OpenBSD, NetBSD, Solaris, Mac OS X, and even IRIX.

Rsync

Rsync is a fast and extraordinarily versatile file copying tool for both remote and local files. Rsync uses a delta-transfer algorithm which provides a very fast method for bringing remote files into sync. It does this by sending just the differences in the files across the link, without requiring that both sets of files are present at one of the ends of the link beforehand. At first glance this may seem impossible because the calculation of diffs between two files normally requires local access to both files.

SafeKeep

SafeKeep is a centralized and easy to use backup application that combines the best features of a mirror and an incremental backup. It sets up the appropriate environment for compatible backup packages and simplifies the process of running them. For Linux users only, SafeKeep focuses on security and simplicity. It’s a command line tool that is a good option for a smaller environment.

Synkron

This application allows you to keep your files and folders updated and synchronized. Key features include an easy to use interface, blacklisting, analysis and restore. It is also cross-platform.

Synbak

Synbak is an software designed to unify several backup methods. Synbak provides a powerful reporting system and a very simple interface for configuration files. Synbak is a wrapper for several existing backup programs suppling the end user with common method for configuration that will manage the execution logic for every single backup and will give detailed reports of backups result. Synbak can make backups using RSync over ssh, rsync daemon, smb and cifs protocols (using internal automount functions), Tar archives (tar, tar.gz and tar.bz2), Tape devices (using multi loader changer tapes too), LDAP databases, MySQL databases, Oracle databases, CD-RW/DVD-RW, Wget to mirror HTTP/FTP servers. It offers official support to GNU/Linux Red Hat Enterprise Linux and Fedora Core Distributions only.

SnapBackup

Designed to be as easy to use as possible, SnapBackup backs up files with just one click. It can copy files to a flash drive, external hard drive or the cloud, and it includes compression capabilities.  The first time you run Snap Backup, you configure where your data files reside and where to create backup files. Snap Backup will also copy your backup to an archive location, such as a USB flash drive (memory stick), external hard drive, or cloud backup. Snap Backup automatically puts the current date in the backup file name, alleviating you from the tedious task of renaming your backup file every time you backup. The backup file is a single compressed file that can be read by zip programs such as gzip, 7-Zip, The Unarchiver, and Mac’s built-in Archive Utility.

Syncovery

File synchronization and backup software. Back up data and synchronize PCs, Macs, servers, notebooks, and online storage space. You can set up as many different jobs as you need and run them manually or using the scheduler. Syncovery works with local hard drives, network drives and any other mounted volumes. In addition, it comes with support for FTP, SSH, HTTP, WebDAV, Amazon S3, Google Drive, Microsoft Azure, SugarSync, box.net and many other cloud storage providers. You can use ZIP compression and data encryption. On Windows, the scheduler can run as a service – without users having to log on. There are powerful synchronization modes, including Standard Copying, Exact Mirror, and SmartTracking. Syncovery features a well designed GUI to make it an extremely versatile synchronizing and backup tool.

XSIbackup

XSIbackup can backup VMwareESXi environments version 5.1 or greater. It’s a command line tool with a scheduler, and it runs directly on the hypervisor. XSIBackup is a free alternative to commercial software like Veeam Backup.

UrBackup

A client-server system, UrBackup does both file and image backups. UrBackup is an easy to setup Open Source client/server backup system, that through a combination of image and file backups accomplishes both data safety and a fast restoration time. File and image backups are made while the system is running without interrupting current processes. UrBackup also continuously watches folders you want backed up in order to quickly find differences to previous backups. Because of that, incremental file backups are really fast. Your files can be restored through the web interface, via the client or the Windows Explorer while the backups of drive volumes can be restored with a bootable CD or USB-Stick (bare metal restore). A web interface makes setting up your own backup server easy.

Unison

This file synchronization tool goes beyond the capabilities of most backup systems, because it can reconcile several slightly different copies of the same file stored in different places. It can work between any two (or more) computers connected to the Internet, even if they don’t have the same operating system. It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other.

Unison shares a number of features with tools such as configuration management packages (CVS, PRCS, Subversion, BitKeeper, etc.), distributed filesystems (Coda, etc.), uni-directional mirroring utilities (rsync, etc.), and other synchronizers (Intellisync, Reconcile, etc). Unison runs on both Windows and many flavors of Unix (Solaris, Linux, OS X, etc.) systems. Moreover, Unison works across platforms, allowing you to synchronize a Windows laptop with a Unix server, for example. Unlike simple mirroring or backup utilities, Unison can deal with updates to both replicas of a distributed directory structure. Updates that do not conflict are propagated automatically. Conflicting updates are detected and displayed.

Win32DiskImager

This program is designed to write a raw disk image to a removable device or backup a removable device to a raw image file. It is very useful for embedded development, namely Arm development projects (Android, Ubuntu on Arm, etc). Averaging more than 50,000 downloads every week, this tool is a very popular way to copy a disk image to a new machine. It’s very useful for systems administrators and developers.

Open Source Cloud Data Storage Solutions

Camlistore

Camlistore is short for “Content-Addressable Multi-Layer Indexed Storage.” Camlistore is a set of open source formats, protocols, and software for modeling, storing, searching, sharing and synchronizing data in the post-PC era. Data may be files or objects, tweets or 5TB videos, and you can access it via a phone, browser or FUSE filesystem. It is still under active development. If you’re a programmer or fairly technical, you can probably get it up and running and get some utility out of it. Many bits and pieces are actively being developed, so be prepared for bugs and unfinished features.

CloudStack

Apache’s CloudStack project offers a complete cloud computing solution, including cloud storage. Key storage features include tiering, block storage volumes and support for most storage hardware.

CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution.

CloudStack is a turnkey solution that includes the entire “stack” of features most organizations want with an IaaS cloud: compute orchestration, Network-as-a-Service, user and account management, a full and open native API, resource accounting, and a first-class User Interface (UI). It currently supports the most popular hypervisors: VMware, KVM, Citrix XenServer, Xen Cloud Platform (XCP), Oracle VM server and Microsoft Hyper-V.

CloudStore

CloudStore synchronizes files between multiple locations. It is similar to Dropbox, but it’s completely free and, as noted by the developer, does not require the user to trust a US company.

Cozy

Cozy is a personal cloud solution allows users to “host, hack and delete” their own files. It stores calendar and contact information in addition to documents, and it also has an app store with compatible applications.

DREBS

Designed for Amazon Web Services users, DREBS stands for “Disaster Recovery for Elastic Block Store.” It runs on Amazon’s EC2 services and takes snapshots of EBS volumes for disaster recovery purposes. It can be used for taking periodic snapshots of EBS volumes. It is designed to be run on the EC2 host which the EBS volumes to be snapshoted are attached.

DuraCloud

DuraCloud is a hosted service and open technology developed by DuraSpace that makes it easy for organizations and end users to use cloud services. DuraCloud leverages existing cloud infrastructure to enable durability and access to digital content. It is particularly focused on providing preservation support services and access services for academic libraries, academic research centers, and other cultural heritage organizations. The service builds on the pure storage from expert storage providers by overlaying the access functionality and preservation support tools that are essential to ensuring long-term access and durability. DuraCloud offers cloud storage across multiple commercial and non commercial providers, and offers compute services that are key to unlocking the value of digital content stored in the cloud. DuraCloud provides services that enable digital preservation, data access, transformation, and data sharing. Customers are offered “elastic capacity” coupled with a “pay as you go” approach. DuraCloud is appropriate for individuals, single institutions, or for multiple organizations that want to use cross-institutional infrastructure. DuraCloud became available as a limited pilot in 2009 and was released broadly as a service of the DuraSpace not-for-profit organization in 2011.

FTPbox

This app allows users to set up cloud-based storage services on their own servers. It supports FTP, SFTP or FTPS file syncing.

Pydio

Pydio is the mature open source alternative to dropbox and box, for the enterprise. Formerly known as AjaXplorer, this app helps enterprises set a file-sharing service on their own servers. It’s very easy to install and offers an attractive, intuitive interface.

Seafile

With Seafile you can set up your own private cloud storage server or use their hosted service that is free for up to 1GB. Seafile is an open source cloud storage system with privacy protection and teamwork features. Collections of files are called libraries. Each library can be synced separately. A library can also be encrypted with a user chosen password. Seafile also allows users to create groups and easily sharing files into groups.

SparkleShare

Another self-hosted cloud storage solution, SparkleShare is a good storage option for files that change often and are accessed by a lot of people. (It’s not as good for complete backups.) Because it was built for developers, it also includes Git. SparkleShare is open-source client software that provides cloud storage and file synchronization services. By default, it uses Git as a storage backend. SparkleShare is comparable to Dropbox, but the cloud storage can be provided by the user’s own server, or a hosted solution such as GitHub. The advantage of self-hosting is that the user retains absolute control over their own data. In the simplest case, self-hosting only requires SSH and Git.

Syncany

Syncany is a cloud storage and filesharing application with a focus on security and abstraction of storage. It is similar to Dropbox, but you can use it with your own server or one of the popular public cloud services like Amazon, Google or Rackspace. It encrypts files locally, adding security for sensitive files.

Syncthing

Syncthing was designed to be a secure and private alternative to public cloud backup and synchronization services. It is a continuous file synchronization program. It synchronizes files between two or more computers. It offers strong encryption and authentication capabilities and includes an easy-to-use GUI.

PerlShare

PerlShare is another Dropbox alternative, allowing users to set up their own cloud storage servers. Windows and OS X support is under development, but it works on Linux today.

SeaFile

SeaFile offers open source cloud storage and file synchronization. You can self-host with the free community or paid professional editions, or you can pay for the service hosting.

Storage Management / SDS

OpenSDS

Advanced OpenSDS API’s enables enterprise storage features to be fully utilized by OpenStack. For End-Users. OpenSDS offers free choice and allows you to choose solutions from different vendors. Start transforming your IT infrastructure into a platform for cloud-native workloads and accelerate new business rollouts.

CoprHD

CoprHD is an open source software defined storage controller and API platform by Dell EMC. It enables policy-based management and cloud automation of storage resources for block, object and file storage providers.

REX-Ray

REX-Ray is a Dell EMC open source project. It’s a container storage orchestration engine enabling persistence for cloud native workloads. New updates and features contribute to enterprise readiness, as {code} by Dell EMC through REX-Ray and libStorage works with industry organizations to ensure long-lasting interoperability of storage in Cloud Native through a universal Container Storage Interface.

Nexenta

From their website: “Nexenta is the global leader in Open Source-driven Software-Defined Storage – what we call Open Software-Defined Storage (OpenSDS).We uniquely integrate software-only “Open Source” collaboration with commodity hardware-centric “Software-Defined Storage” (SDS) innovation.”

Libvirt Storage Management

libvirt is an open source API, daemon and management tool for managing platform virtualization.[3] It can be used to manage KVM, Xen, VMware ESX, QEMU and other virtualization technologies. These APIs are widely used in the orchestration layer of hypervisors in the development of a cloud-based solution.

OHSM

Online Hierarchical Storage Manager (OHSM) is the first attempt towards an enterprise level open source data storage manager which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise’s data on slower devices, and then copy data to faster disk drives when needed. In effect, OHSM turns the fast disk drives into caches for the slower mass storage devices. There would be certain policies that would be set by the data center administrators as to which data can safely be moved to slower devices and which data should stay on the fast devices. Under manual circumstances the data centers suffers from down time and also change in the namespace. Policy rules specify both initial allocation destinations and relocation destinations as priority-ordered lists of placement classes. Files are allocated in the first placement class in the list if free space permits, in the second class if no free space is available in the first, and so forth.

Open Source Data Destruction Solutions

BleachBit

With BleachBit you can free cache, delete cookies, clear Internet history, shred temporary files, delete logs, and discard junk you didn’t know was there. Designed for Linux and Windows systems, it wipes clean thousands of applications including Firefox, Internet Explorer, Adobe Flash, Google Chrome, Opera, Safari,and more. Beyond simply deleting files, BleachBit includes advanced features such as shredding files to prevent recovery, wiping free disk space to hide traces of files deleted by other applications, and vacuuming Firefox to make it faster.

Darik’s Boot And Nuke

Darik’s Boot and Nuke (“DBAN”) is a self-contained boot image that securely wipes the hard disks of most computers. DBAN is appropriate for bulk or emergency data destruction. This app can securely wipe an entire disk so that the data cannot be recovered. The owner of the app, Blancco, also offers related paid products, including some that support RAID.

Eraser

Eraser is a secure data removal tool for Windows. It completely removes sensitive data from your hard drive by overwriting it several times with carefully selected patterns. It erases residue from deleted files, erases MFT and MFT-resident files (for NTFS volumes) and Directory Indices (for FAT), and has a powerful and flexible scheduler.

FileKiller

FileKiller is another option for secure file deletion. It allows the user to determine how many times deleted data is overwritten depending on the sensitivity of the data being deleted. It offers fast performance and can handle large files.

It features High Performance, the ability to choose the number of overwrite iterations (1 to 100), the ability to choose overwrite method using blanks, the ability to choose overwrite method using random data, the ability to choose overwrite method using a user defined ascii character, data as well as Filename deletion. No setup is needed, you get just a single executable, and it it requires .net 3.5.

Open Source Distributed Storage/Big Data Solutions

BigData

Big data describes itself as an ultra high-performance graph database supporting the RDF data model. It can scale to 50 billion edges on a single machine. Paid Commercial support is available for this product.

Hadoop

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. This project is so well known that it has become nearly synonymous with big data.

HPCC

HPCC Systems (High Performance Computing Cluster) is an open source, massive parallel-processing computing platform for big data processing and analytics. It is Intended as an alternative to Hadoop. It is a distributed data storage and processing platform that scales to thousands of nodes. It was developed by LexisNexis Risk Solutions, which also offers paid enterprise versions of the software.

Sheepdog

Sheepdog is a distributed object storage system for volume and container services and manages the disks and nodes intelligently. Sheepdog features ease of use, simplicity of code and can scale out to thousands of nodes. The block level volume abstraction can be attached to QEMU virtual machines and Linux SCSI Target and supports advanced volume management features such as snapshot, cloning, and thin provisioning. The object level container abstraction is designed to be Openstack Swift and Amazon S3 API compatible and can be used to store and retrieve any amount of data with a simple web services interface. It’s compatible with OpenStack Swift and Amazon S3.

Open Source Document Management Systems (DMS) Solutions

bitfarm-Archiv Document Management

bitfarm-Archiv document management is an intuitive, award-winning software with fast user acceptance. The extensive and practical functionality as well as the excellent adaptability makes the open source DMS to one of the most powerful document management, archiving, and ECM solution for institutions and in all sectors at low cost. A paid enterprise version and paid services is available.

DSpace

Highly rated DSpace describes itself as “the software of choice for academic, non-profit, and commercial organizations building open digital repositories.” It offers a Web-based interface and very easy installation.

Epiware

Epiware offers customizable, Web-based document capture, management, storage, and sharing. Paid support is also available.

LogicalDOC

LogicalDOC is a Web-based, open source document management software that is very simple to use and suitable for organizations of any size and type. It uses the best-of-breed Java technologies such as Spring, Hibernate and AJAX and can run on any system, from Windows to Linux or MAC OS X. The features included in the community edition — including workflow light, version control and the full-text search engine – help manage the document lifecycle, encourage cooperation, allow to quickly find the document you need without wasting time. The application is implemented as a plugin system that allows you to easily add new features through the ability to engage the various extension points predisposed. Moreover, the presence of Web services ensures that LogicalDOC can be easily integrated with other systems.

OpenKM

OpenKM integrates all essential documents management, collaboration and an advanced search functionality into one easy to use solution. The system also includes administration tools to define the roles of various users, access control, user quota, level of document security, detailed logs of activity and automations setup. OpenKM builds a highly valuable repository of corporate information assets to facilitate knowledge creation and improve business decision making, boosting workgroups and enterprise productivity through shared practices, greater, better customer relations, faster sales cycles, improved product time-to-market, and better-informed decision making.

Open Source Encryption Solutions

AxCrypt

Downloaded nearly 3 million times, AxCrypt is one of the leading open source file encryption software for Windows. It works with the Windows file manager and with cloud-based storage services like Dropbox, Live Mesh, SkyDrive and Box.net. It offers Personal Privacy and Security with AES-256 File Encryption and Compression for Windows. Double-click to automatically decrypt and open documents.

Crypt

Extremely lightweight, the 44KB Crypt promises very fast encryption and decryption. You don’t need to install it, and it can run from a thumb drive. This tool is command line only, expected for such a lightweight application.

Gnu Privacy Guard (GPG)

GNU Privacy Guard. GNU Privacy Guard (GnuPG or GPG) is a free software replacement for Symantec’s PGP cryptographic software suite. GnuPG is compliant with RFC 4880, which is the IETF standards track specification of OpenPGP. Gnu’s implementation of the OpenPGP standard allows users to encrypt and sign data and communication. It’s a very mature project that hass been under active development for well over a decade.

gpg4win (GNU privacy guard for Windows)

See above. This is a port of the Linux version of GPG. It’s easy to install and includes plug-ins for Outlook and Windows Explorer.

GPG Tools

See above. This project ports GPG to the Mac.

TrueCrypt

TrueCrypt is a discontinued source-available freeware utility used for on-the-fly encryption (OTFE). It can create a virtual encrypted disk within a file, or encrypt a partition or the whole storage device (pre-boot authentication). Extremely popular, this utility has been downloaded millions of times. It can encrypt both single files or entire drives or partitions.

 

Defining Software-Defined Storage: Benefits, Strategy, Use Cases, and Products

This blog entry was made out of personal interest.  I was curious about the current state of software defined storage in the industry and decided to get myself up to speed.  I’ve done some research and reading on SDS off and on over the course of the last week and this is a summary of what I’ve learned from various sources around the internet.

What is SDS?

First things first.  What is “Software-Defined Storage?”  The term is very broadly used to describe many different products with various features and capabilities. It seems to me to be a very overused and not very well defined term, but it is the preferred term for defining the trend towards data storage becoming independent of the underlying hardware. In general, SDS describes data storage software that includes policy-based provisioning and management of data storage that is independent of the underlying hardware. The term itself has been open to interpretation among industry experts and vendors, but it usually encompasses software abstraction from hardware, policy-based provisioning and data management, and allows for a hardware agnostic implementation.

How the industry defines SDS

Because of the ambiguity surrounding the definition, I looked up multiple respected sources on how the term is defined in the industry. I first looked at IDC and Gartner. IDC defines software defined storage solutions as solutions that deploy controller software (the storage software platform) that is decoupled from underlying hardware, runs on industry standard hardware, and delivers a complete set of enterprise storage services. Gartner defines SDS in two separate parts, Infrastructure and Management:

  • Infrastructure SDS (what most of us are familiar with) utilizes commodity hardware such as x86 servers, JBOD, JBOF or other and offers features through software orchestration. It creates and provides data center services to replace or augment traditional storage arrays.
  • Management SDS controls hardware but also controls legacy storage products to integrate them into a SDS environment. It interacts with existing storage systems to deliver greater agility of storage services.

The general characteristics of SDS

So, based on what I just discussed, what follows is my summary and explanation of the general defining characteristics of software defined storage. These key characteristics are common among all vendor offerings.

  • Hardware and Software Abstraction. SDS always includes abstraction of logical storage services and capabilities from the underlying physical storage systems.
  • Storage Virtualization. External-controller based arrays include storage virtualization to manage usage and access across the drives within their own pools, other products exist independently to manage across arrays and/or directly attached server storage.
  • Automation and Orchestration. SDS includes automation with policy-driven storage provisioning, and service-level agreements (SLAs) generally replace the precise details of the actual hardware. This requires management interfaces that span traditional storage-array products.
  • Centralized Management. SDS includes management capabilities with a centralized point of management.
  • Enterprise storage features. SDS includes support for all the features desired in an enterprise storage offering, such as compression and deduplication, replication, snapshots, data tiering, and thin provisioning.

Choosing a strategy for SDS

There are a host of considerations when developing a software defined storage strategy. Below is a list of some of the important items to consider during the process.

  • Cloud Integration. It’s important to ensure any SDS implementation will integrate with the cloud, even if you’re not currently using cloud services in your environment. The storage industry is moving heavily to cloud workloads and you need to be ready to accommodate business demands in that area. In addition, Amazon’s S3 interface has become the default protocol for cloud communication, so choose an SDS solution can supports S3 for seamless integration.
  • Storage Management Analysis. A deep understanding of how SDS is managed alongside all your legacy storage will be needed. You’ll need a clear understanding of the capacity and performance being used in your environment. Determine where you might need more performance and where you might need more capacity. It’s common in the industry now to not have a deep understanding of how your storage impacts the business, to lack a service catalog portfolio, and have limited resources managing your critical storage. If your organization is on top of those common issues, you’re well ahead of the game.
  • Research your options well. SDS really marks the end of large isolated storage environments. It allows organizations to move away from silos and customize solutions to their specific business needs. SDS allows organizations to build a hybrid of pretty much anything.   Taking advantage of high density NL-SAS disks right next to the latest high performance all-flash array is easily done, and the environment can be tuned to specific needs and use cases.
  • Pay attention to Vendor Support. There are also concerns about support. A software vendor will of course support its own software-defined storage product, but will they offer support when there is a conflict between a heterogeneous hardware environment and their software? Organizations should plan and architect the environment very carefully. All competent software vendors will offer a support matrix for hardware, but only so much can be done if there is a bug in the underlying hardware.
  • Performance Impact analysis. Just like any traditional storage implementation, predictability of performance is an important item to consider when implementing an SDS architecture. A workload analysis and a working knowledge of your precise performance requirements will go a long way toward a successful implementation. Many organizations run SDS on general-purpose, server-class servers and not the purpose-built systems designed solely for storage. Performance predictability can be especially concerning when SDS is implemented into a hyper-converged environment, as the hosts must run the SDS software while also running business applications.
  • Implementation Timeframe. SDS technology can make initial implementation more time consuming and difficult, especially if you choose a software only solution. The flexibility SDS offers provides a storage architect with many more design options, which of course translates into a much more extensive hardware selection process.   Organizations must carefully evaluate the various SDS components and the total amount of time it will consume to select the appropriate storage, networking, and server hardware for the project.
  • Overall Cost and ROI. I’m sure you’ll hear this from your vendor – they will promise that SDS will decrease both acquisition and operational costs while simultaneously increasing storage infrastructure flexibility. Your results may vary, and be aware that the software based products more closely resemble the original intention of this technology and are the best suited to provide those promised benefits. A software based SDS architecture will likely involve a more complex initial implementation with higher costs. While bundled products may offer a better implementation experience, they may limit flexibility.   Determining if software solutions and bundled hardware solutions are a better fit largely depends on whether your IT team has the time and skills required to research and identify the required hardware components on their own. If so, a software-only product can provide for significant savings and provide maximum flexibility.
  • Avoid Forklift Upgrades. One of the original purposes of SDS was to be hardware agnostic so there should be no reason to remove and replace all of your existing hardware. Organizations should research solutions that enable you to protect your existing investment in hardware as opposed to requiring a forklift upgrade of all of your hardware. A new SDS implementation should complement your environment and protect your investment in existing servers, storage, networking, management tools and employee skill sets.
  • Expansion and Upgrade capability. Before you buy new hardware to expand your environment, confirm that the additional hardware can seamlessly integrate with your existing cloud or datacenter environment. Organizations should look for products that allow easy and non-disruptive hardware and software expansions & upgrades, without the need for additional time consuming customization.
  • Storage architecture. The fundamental design of the hardware can expose both efficiencies and deficiencies in the solution stack. Everything should be scrutinized from the tiniest details. Pay particular attention to features that affect storage overhead (deduplication, compression, etc).
  • Test your application workloads. Often overlooked is the fact that a storage infrastructure exists to entirely to facilitate data access by applications. It’s a common mistake to downplay the importance of an application workload analysis. Consider a proof of concept or extensive testing with a value added reseller with your own data if possible, it’s the only way to ensure it will meet your expectations when it’s placed into a production environment.   If it’s possible, test SDS software solutions with your existing storage infrastructure before a purchase is made as it will help reveal just how hardware independent the SDS software actually is.

Potential use cases and justification for SDS 

The impact that SDS solutions will continue to have a significant impact on the traditional storage market moving into the future. IDC research suggests that traditional stand-alone hybrid systems are expected to start declining while new all-flash, hyper-converged and software-defined storage adoption will be growing at a much faster rate.  So, on to some potential use cases:

  • Non-disruptive data migrations.  This is where appliance and storage controller based virtual solutions have already been used quite successfully for many years.   I have experience installing and managing the VPLEX storage virtualization device into an existing storage infrastructure quite successfully, and it was used extensively for non disruptive data migrations in the environment that I supported. By inserting an appliance or storage controller based SDS solution into an existing storage network between the server and backend storage, it’s then easy to virtualize the storage volumes on both existing and new storage arrays and then migrate data seamlessly and non-disruptively from old arrays to new ones. Weekend outages were turned into much shorter non-disruptive upgrades that the application was completely unaware of.  Great stuff.
  • Better managing deployments of archival/utility storage.  Organizations in general seem to have a growing need for deployments of large amounts of archive storage in their environments (low cost, high density disk). It’s not uncommon to have vast amounts of data with an undefined business value, but it’s sufficiently valuable that it cannot easily or justifiably be deleted. In cases like that storage arrays that are reliable, stable, and economical perform moderately well and remain easy to manage and scale are a good fit for an SDS solution. The storage that  this data resides on would need very few extra enterprise features like auto-tiering, VM integration, deduplication, etc. Cheap and deep storage will work here, and SDS solutions work in these environments. Whether the SDS software resides on a storage controller or on an appliance, more storage capacity can be quickly and easily added to these environments and then easily managed and scaled. Many of the interoperability and performance issues that have hurt SDS deployments in the past don’t make much difference in a situation where it’s simply archive data.
  • Managing heterogeneous storage environments. One of the big issues with appliance and storage controller-based SDS solutions at the beginning was that they attempted to do it all by virtualizing storage arrays from every vendor under the sun and failed to create a single pane of glass to manage all of the storage capacity while providing a common, standardized set of storage management features.  That feature is now a game changer in complex environments and is offered by most vendors. Implementing SDS can dramatically reduce administrative time and allow your top staff to focus on more important business needs.

The benefits of SDS

What follows is a summary of some of the key benefits of implementing SDS. This list is what you’re most likely to hear from your friendly local salesperson and in the marketing materials from each vendor. J

  • Non-disruptive hardware expansion. SDS solutions can enable storage capacity expansion without disruption.   New arrays can be added to the environment and data can be migrated completely non-disruptively.
  • Cloud Automation. SDS provides an optimal storage platform for next generation infrastructure of on-prem & private data centers that offers public cloud scale economics, universal access and self-service automation to private clouds.
  • Economics. SDS has potential to significantly reduce operational and management expenses using policy based automation, ease of deployment, programmable flexibility, and centralized management while providing hardware independence and using off the shelf industry-standard components to lower storage system costs. Some vendor offerings will allow the user to leverage existing hardware.
  • Increased ROI. SDS allows policy-driven data center automation that provides the ability to provision storage resources immediately based on VM workload demand. This capability of SDS will encourage organizations to deploy SDS offerings to improve their opex and capex, providing a quick return on investment (ROI).
  • Real-time scalability. SDS offers tiered capacity by service level and the ability to provision storage on demand, which enables optimal capacity based on current business requirements. It also provides details metrics for reporting of storage infrastructure usage.
  • High Availability. SDS architectures can provide for improved business continuity. In the event of a hardware failure, an SDS environment can shift load and data automatically to another available node.       Because the storage infrastructure sits above the physical hardware, any hardware can be used to replace a failed node. Older systems could even be recycled to improve disaster recovery provisions in SDS, further improving your ROI.

 

The trends for SDS in 2017 and beyond

The SAN guy is not a fortune teller, but these predictions are all creating a buzz in the industry and you’re likely to see them start to materialize in 2017.

  • SDS catches up to traditional storage. SDS is finally catching up with traditional storage. Now that enterprise-class storage features like inline deduplication, compression and QoS have been introduced across the market leaders in SDS solutions, it’s finally becoming a more mainstream solution. The rapidly declining cost of EFD along with the performance and reliability of SDS are really making it well suited for the virtual workloads of many organizations.
  • Multiple Cloud implementations. Analysts are predicting that SDS will introduce a new multi-cloud era in 2017, as leveraging the power of a software-defined infrastructure that is not tied to a specific hardware platform and configuration. SDS users will finally have a defined cloud strategy that is evolutionary to what they are doing today. As a result, IT has to be prepared to support new application models designed to bring the simplicity and agility of cloud to on-premises infrastructure. At the same time, new software-defined infrastructure enables a flexible multi-cloud architecture that extends a common and consistent operating environment from on-prem to off-prem, including public clouds.
  • Management integration improves. Integration will continue to improve. The continued integration of management into hypervisor tools, computational platforms, hyper-converged systems, and next-generation service based infrastructures will continue to enhance SDS capabilities.
  • Storage leaves the island.   Traditional storage implementations typically have many different islands of storage in independent silos. It’s been difficult to break that mold based on business requirements and the hardware and software available to provide the necessary multi-tenancy and still meet those requirements. SDS will begin to allow organizations to consolidate those islands of storage and break the artificial barriers.
  • Increased Hybrid SDS deployments. The use of SDS will continue to move toward hybrid implementations. Organizational requirements will drive the change. It’s no secret that more workloads are moving toward the cloud, and SDS will help break down that boundary. SDS will also start to blur the lines between data that is in the cloud and data that is locally stored and help make data mobility more seamless, improving the fluidity while taking into accound regulatory requirements, cost, and performance.
  • The Software-Defined Data Center starts to materialize. The ultimate goal for SDS is the software defined data center. Implementing a Hyper-converged infrastructure (HCI) is important to reach that goal, but in order to achieve it HCI must deliver consistent and predictable performance to all elements of data center management, not just storage. SDS and HCI are the stepping stones for that goal.

Software Defined Storage Vendors

Now that we have an idea of what SDS is and what it can be used for, let’s take a look at the vendors that offer SDS solutions. I put together a vendor list below along with a brief description of the product that is based mostly on the company’s marketing materials.

SwiftStack

SwiftStack’s design goal is to make it easy to deploy, operate, and scale, as well as to provide the fastest experience when deploying and managing a private cloud storage system. Another key design element is to enable large-scale growth without any disruption to performance. It has no fixed hardware configurations and can be configured using any server hardware. It is also licensed for the amount of data capacity utilized, not the total amount of hardware capacity deployed, allowing organizations to pay-as-they-grow using annual licenses.

SwiftStack offers a reliable, massively scalable, software defined storage platform. It seamlessly integrates with existing IT infrastructures, running on standard hardware, and replicated across globally distributed data centers.

HPE StoreVirtual VSA

HPE StoreVirtual VSA is storage software that runs on commodity hardware in a virtual machine in any virtualized server environment, including VMware, Hyper-V, and KVM. It turns any media presented to it via the hypervisor into shared storage. It presents the storage to all physical and virtual hosts in the environment as an iSCSI array. Additionally, StoreVirtual VSA is part of an integrated family of solutions that all share the same storage operating system, including StoreVirtual arrays and HPE’s hyper-converged systems. It has a full enterprise storage feature set that provides the capabilities and performance you would expect from a traditional storage area network. It provides low cost data protection that delivers fast, efficient, and scalable backup and does not require dedicated hardware.

HPE StoreOnce VSA

StoreOnce VSA is a SDS solution that provides backup and recovery for virtualized environments. It enables organizations to reduce the cost of secondary storage by eliminating the need for a dedicated backup appliance. It shares the same deduplication algorithm and storage features as the StoreOnce Disk Backup family, including the ability to replicate bi-directionally from a physical backup appliance to SDS.

Metalogix StoragePoint

StoragePoint is a SharePoint storage optimization solution that offloads unstructured SharePoint content data, which is known as Binary Large Objects (BLOBs), from SharePoint’s underlying SQL database to alternate tiers of storage. BLOBs quickly overwhelm the SQL database that powers SharePoint, resulting in poor performance that is expensive to maintain and grow. Many rich media formats are too large to store in SQL Server due to technical limitations, resulting in a collaboration platform that cannot address all the content needs of an organization.

StoragePoint optimizes SharePoint Storage using Remote Blob Storage (RBS). It provides a method to address file content storage issues related to large file size, slow user query times and backup failures. It externalizes SharePoint content so it can be stored and managed anywhere. An automated rules engine places content in the most appropriate storage locations based on the type, criticality, age and frequency of use.

VMware vSAN

Previously known as VMware Virtual SAN, vSAN addresses hyper-converged infrastructure systems. It aggregates locally attached disks in a vSphere cluster to create storage that can be provisioned and managed from vCenter and vSphere Web Client tools. This enables organizations to evolve their existing virtualization environment with the only natively integrated vSphere solution and leverages multiple server hardware platforms. It reduces TCO due to the cost savings of utilizing server side storage, with more affordable flash storage, on demand scaling, and simplified storage management. It can also be expanded into a complete SDS solution that can provide the foundation for a cloud architecture.

Using the VMware SDS model, the data level that’s responsible for storing data and implementing data services such as replication and snapshots is virtualized by abstracting physical hardware resources, and aggregating them into logical pools of capacity (called virtual datastores) that can be used and managed with a high degree of flexibility. By making the virtual disk the basic unit of management for storage operations in the virtual datastores, precise combinations of hardware resources and storage services can be configured and controlled independently for each virtual machine.

Microsoft S2D

Microsoft Storage Spaces Direct (or S2D) is a part of Windows Server 2016. It can be combined with Storage Replica (SR) along with resilient file system cache tiering to create scale-out, converged and hyper-converged infrastructure SDS for Windows Servers and Hyper-V environments. It has the capability to use existing tools and has many flexible configuration and deployment options.

Infinidat InfiniBox

InfiniBox is based upon a fully abstracted set of software driven storage functions layered on top of industry standard hardware, and delivers a fast, highly available, and easy-to-deploy storage system. Extreme reliability and performance is delivered through their innovative self-healing architecture, high performance double-parity RAID, and comprehensive end-to-end data verification capability. They also feature an efficient data distribution architecture that uses all of the installed drives all the time. It has a large flash cache that deliver ultra-high performance that can match or exceed 12GB/s throughput (yes, it’s a marketing number).

Pivot3

Pivot3’s virtual storage and compute operating environment, known as vSTAC, is designed to maximize overall resource utilization, providing efficient fault tolerance and giving IT the flexibility to deploy on a wide range of commodity x86 hardware. A distributed scale-out architecture pools compute and storage from each HCI node into high-availability clusters, accessible by every VM and application. Its Scalar Erasure Coding is said to be more efficient than network RAID or replication protection schemes, and it maintains performance during degraded mode conditions. Pivot3 owns multiple SDS patents, one covering their technology that creates a cross-node virtual SAN that can be accessed as a unified storage target by any application running on the cluster. By converging compute, storage and VM management, they automate system management with self-optimizing, self-healing and self-monitoring features. Their vCenter plugin provides a single pane of glass to simplify management of single and multi-site deployments.

EMC VIPR Controller

EMC ViPR Controller provides Software Defined Storage automation that centralizes and transforms multivendor storage into a simple and extensible platform. It also performs infrastructure provisioning on VCE Vblock Systems. It abstracts and pools resources to deliver automated, policy-driven, storage as-a-service on demand through a self-service catalog across a multi-vendor environment. It integrates with cloud stacks like VMWare, OpenStack, and Microsoft and offers RESTful APIs for integrating with other management systems and offers multi-vendor platform support.

EMC ECS (Elastic Cloud Storage)

ECS provides a complete software-defined cloud storage platform for commodity infrastructure. Deployed as a software-only solution or as a turnkey appliance, ECS offers all the cost savings of commodity infrastructure with enterprise reliability, availability, and serviceability. EMC launched it as its next generation hyper scale object-based storage solution, it was originally designed to overcome the limitations of Centera. It is used to store, archive, and access unstructured content at scale. It’s designed to allow businesses to deploy massively scalable storage in a private or public cloud, and allows customizable metadata for data placement, protection, and lifecycle policies. Data protection is provided by a hybrid encoding approach that utilizes local and distributed erasure coding for site level and geographic protection.

EMC ScaleIO

ScaleIO is a software-only server based storage area network that combines storage and compute resources to form a single-layer. It uses existing local disks and LANs so that the host can realize a virtual SAN with all the benefits of external storage. It provides virtual and bare metal environments with scale, elasticity, multi-tenant capabilities, and service quality that enables Service Providers to build high performance, low cost cloud offerings. It enables full data protection and persistence. The software ensures enterprise-grade resilience through meshed mirroring of randomly sliced and distributed data chunks across multiple servers.

IBM Spectrum Storage

IBM spectrum software is part of a comprehensive family of software-defined storage solutions. It is specifically structured to meet changing storage needs, including hybrid cloud, and is designed for organizations just starting out with software-defined storage as well as those with established infrastructures who need to expand their capabilities.

NetApp StorageGrid

NetApp’s SDS offerings include NetApp clustered Data ONTAP OS, NetApp OnCommand, NetApp FAS series, and NetApp FlexArray virtualization software. Some features of NetApps SDS include virtualized storage services that includes effective provision of data storage and access based on service levels, multiple hardware options that Supports deployment in a variety of enterprise platforms, and application self-service which delivers APIs for workflow automation and custom applications.

DataCore

DataCore’s storage virtualization software allows organizations to seamlessly manage and scale data storage architectures, delivering massive performance gains at a much lower cost than solutions offered by legacy storage hardware vendors. DataCore has a large customer base around the globle. Their adaptive and self-learning and healing technology eases management, and it’s solution is completely hardware agnostic.

Nexenta

Nexenta integrates software-only “Open Source” collaboration with commodity hardware. Their software is installed in thousands of companies around the world serving a wide variety of workloads and business-critical situations. It powers some of the world’s largest cloud deployments. With their complete Software-Defined Storage portfolio and recent updates to NexentaConnect for VMware VSAN and the launch of NexentaEdge, they offer a robust SDS solution.

Hitachi Data Systems G-Series

Hitachi Virtual Storage Platform G1000 provides the always-available, agile and automated foundation needed for a on-prem or hybrid cloud infrastructure. Their software enables IT agility and a low TCO. They delivering a top notch combination of enterprise-ready software-defined storage, global storage virtualization, along with efficient, scalable, and high performance hardware. It also supports self-managing policy-driven management. Their SDS implementation includes Hitachi Virtual Storage Platform G1000 (VSP G1000) and Hitachi Storage Virtualization Operation System (SVOS).

StoneFly SCVM

The StoneFly Storage Concentrator Virtual Machine (SCVM) Software-Defined Unified Storage (SDUS) is a Virtual IP Storage Software Appliance that creates a virtual network storage appliance using the existing resources of an organization’s virtual server infrastructure. It is a virtual SAN storage platform for VMware vSphere ESX/ESXi, VMware vCloud and Microsoft Hyper-V environments and provides an advanced, fully featured iSCSI, Fibre Channel SAN and NAS within a virtual machine to form a Virtual Storage Appliance.

Nutanix

Nutanix’s software-driven Xtreme Computing Platform natively converges compute, virtualization and storage into a single solution. It offers predictable performance, linear scalability and cloud-like infrastructure consumption. PernixData FVP software is a 100% software solution that clusters server flash and RAM to create a low latency I/O acceleration tier for any shared storage environment.

StorPool

StorPool is a storage software solution that runs on standard commodity based servers and builds scalable, high-performance SDS system. It offers great flexiblity and can be deployed in both converged or on separate storage nodes. It has an advanced fully-distributed architecture and is one of the fastest and most efficient cloud ready block-storage software solutions available.

Hedvig

Hedvig collapses traditional tiers of storage into a single, software platform designed for primary and secondary data. Their patented “Universal Data Plane” architecture stores, protects, and replicates data across multiple private and public clouds. The Hedvig Distributed Storage Platform is a single software-defined storage solution that is designed to meet the needs of primary, secondary, and cloud data requirements. It is a distributed system that provides cloud-like elasticity, simplicity, and flexibility.

Amax StorMax SDS

StorMax SDS is a highly available software-designed storage solution that delivers unified file and block storage services with enterprise-grade data management, data integrity, and performance that can scale from tens of Terabytes to Petabytes. It is seamlessly integrated with NexentaStor and the plug and play appliances are designed to be a simple swap-in replacement of legacy block and file storage appliances, offering unlimited file system sizes, unlimited snapshots and clones, and inline data reduction for additional storage cost savings. It’s well suited for VMWare, OpenStack, or CloudStack backend storage, generic NAS file services, home directory storage, and near-line archive and large backup & archive repositories.

Atlantis USX

USX is Atlantis’ SDS software solution. It includes policy-based management of storage resources, storage pooling and automation of storage functions. It also provides a REST API to allow organizations to automate storage functions. It promises to deliver the performance of an all-flash storage array at a lower cost than that of traditional SAN or NAS. The marketing materials state that you can pool any SAN, NAS or DAS storage and accelerate its performance by up to 10x, while at the same time consolidating storage to increase storage capacity by up to 10x.

LizardFS

The LizardFS SDS solution is a distributed, scalable, fault-tolerant and highly available file system that runs on commodity hardware. It allows users to combine disk space located on many servers into a single name space that is visible on Unix and Windows. SDS LizardFS ensures file security by keeping all the data in many replicas spread over all available servers. Disk and server failures are handled transparently without any downtime or loss of data. As your storage requirements grow it scales by adding new servers without any downtime. The system will automatically move distribute data to the newly added servers as it continuously balances disk usage across all connected nodes. Removing servers is as easy as adding a new one.

That’s a large portion of the SDS vendor playing field, but there are others. You can also check out the offerings from Maxta, Tarmin, Coraid, Cohesity, Scality, Starwind, and Red Hat Storage Server (Ceph).

There were long pauses in between as I worked on this blog post in an on and off manner, so I may make some editorial changes and additions in the coming weeks.  Feedback is welcomed and appreciated.

What’s the difference between a Storage Engineer and a Storage Administrator?

During my recent job search a recruiter asked me if there was a difference between a Storage Administrator and a Storage Engineer. He had no idea. I was initially a bit surprised at the question, as I’ve always assumed that it was widely accepted that an engineer is more involved in the architecture of systems whereas an administrator is responsible for managing them.  While his question was about Storage, it applies to many different disciplines in the IT industry as both the Administrator and Engineer titles are routinely appended to “System”, “Network”, “Database”, etc.  Many companies use the terms completely interchangeably and many storage professionals perform both roles. In my experience HR Departments generally label all technical IT employees as “Analysts”, no matter which discipline you specialize in.

From my own personal perspective, I present the following definitions:

Storage Engineer: A person who uses a disciplined, methodical approach to the design, realization, technical management, operation, and life-cycle management of a storage environment.

Storage Administrator: A person who is responsible for the daily upkeep, technical configuration, support, and reliable operation of a storage environment.

To all of my recruiter friends and associates, please think of the System Engineer as the person who is responsible for laying the foundation and ensuring that it is implemented properly.  Afterwards, the Administrator is responsible for carrying out the daily routines and supporting the vision of the engineer.

Does one title outrank the other? No. In my opinion they’d be equal. As I mentioned before HR departments generally don’t distinguish between the two and both usually are in the same pay grade, and the overlap of responsibilities is such that many people perform the duties of both regardless if their title is one or the other. In my experience performing both roles at multiple companies, A Storage Engineer at any given company is given a problem and in a nutshell their job is to find the best solution for it. What is the normal process for finding the best solution?  The Engineer researches and develops the best possible combinations of network, compute, and storage resources along with all the required software features and functionality after investigating a multitude of different vendors and technologies.  Storage industry trends and new technologies are usually researched as well. Following that research, they finally determine the best fit based on the cost, the specific business use case, expansion and scalability, performance testing in a lab or onsite with a proof of concept, all while taking into account the ease of administration and supportability of the hardware and software from both a vendor and internal admin standpoint. A Storage Administrator is generally heavily involved in this decision-making process as they will be responsible for the tuning of the environment to optimize the performance and reliability to the customer, as a result their opinion during the research phase is crucial.  Administrator feedback based on job experience is critical in the research and testing phase across the board, it’s simply not something that’s taught in a book or degree program.
With the considerable overlap between the two jobs in most companies, it’s not surprising they are used so interchangeably and that there is general market confusion about the difference. A company isn’t going to hire a group of storage administrators to simply sit at a desk and monitor a group of storage arrays, they will be required to understand the process of building a complex storage environment and how it fits in to the specific business environment. Engineering and administering a Petabyte-scale global storage environment is very complex no matter which title you’re given. A seasoned Storage Administrator or Storage Engineer should both be up to the task, regardless of how you define their roles. At the end of the day, I’m proud to be a SeniorSANStorageAnalystAdminEngineerSpecialist professional. 🙂

Did I get any of this wrong?  Share your feedback with me in the comments section.

A Primer on Object Storage

objectstorage

I recently took a new job in the Enterprise storage group at a large company, and as part of my new responsibilities I will be implementing, configuring, and managing our object storage.  We will be using EMC’s ECS solution, however I’ve been researching object storage as a platform in general to learn more about it, it’s capabilities, it’s use cases, and it’s general management. This blog post is a summary of the information I’ve assimilated about object storage along with some additional information about the vendors who offer object storage solutions.

At its core, what is object based storage?  Object based storage is an architecture that manages data as objects as opposed to other architectures such as file systems that manage data as a file hierarchy, and block storage that manages data as blocks within sectors & tracks.  To put it another way, Object Storage is not directly accessed by the operating system and is not seen as a local or remote filesystem by an operating system. Interaction with data occurs only at the application level via an API.  While Block Storage and File Storage are designed to be consumed by your operating system, Object Storage is designed to be consumed by your application.  What are the implications of this?

  • Byte level interaction with data is no longer possible. Data objects are stored or retrieved in their entirety with a single command. This is results in powerful scalability by making all file I/O sequential, which is much higher performance than random I/O.
  • It allows for easier application development by providing a higher level of abstraction than traditional storage platforms.
  • Interaction can happen through a single API endpoint. This removes complex storage network topologies from the design of the application infrastructure, and dramatically reduces security vulnerabilities as the only available access is the HTTP/HTTPS API and the service providing the API functionality.
  • Filesystem level utilities cannot interact directly with Object Storage.
  • Object Storage is one giant volume, resulting in almost all storage management overhead of Block and File Storage being eliminated.

Object storage is designed to be more scalable than traditional block and file storage and it specifically targets unstructured content.  In order to achieve improved scalability over traditional file storage it bundles the data with additional metadata tags and a unique identifier. The metadata is completely customizable, allowing an administrator to input much more identifying information for each data object. The objects are also stored in a flat address space which makes it much easier to locate and retrieve data across geographic regions.

Object storage began as a niche technology for enterprises, however it quickly became one of the basic general underlying technologies of cloud storage.  Object storage has become a valid alternative to file based storage due to rapid data growth and the proliferation of big data in the enterprise, an increased demand for private and hybrid cloud storage and a growing need for customizable and scalable storage infrastructures.  The number of object storage products has been rapidly expanding from both major storage vendors and startup companies in recent years to accommodate the increasing demand.  Because many vendors offer Object Storage platforms that run on commodity hardware, even with data protection overhead the price point is typically very attractive when compared to traditional storage.

WHAT’S THE DIFFERENCE BETWEEN OBJECT AND FILE STORAGE?

Object storage technology has been finding its way into file based use cases at many companies.  In some cases object storage vendors are positioning their products as viable NAS alternatives. To address the inherent limitations of traditional file and block level storage to reliably support a huge amount of data and at the same time be cost-effective, object storage focuses on scalability, resiliency, security and manageability. So, what’s the difference between the two?  In general, the difference is in it’s performance, geographic distribution, scalability, and analytics.

OBJECT STORAGE IS HIGHLY SCALABLE

Scalability is a major issue in storage, and it’s only increasing as time goes on. If you need to scale into the petabytes and beyond, you many need to scale in an order of magnitude greater than what a traditional single storage system is capable of.  As traditional storage systems aren’t going to scale to that magnitude, a different type of storage needs to be considered that can still be cost effective, and object storage fills that need very well.

Object storage overcomes many of the scalability limitations that file storage faces.  I really liked the warehouse example that Cloudian used on their website, so I’m going to summarize that.  If you think of file storage as a large warehouse, when you first store a box of files your warehouse looks almost empty and your available space looks infinite. As your storage needs expand that warehouse fills up before you know it, and being in the big city there’s no room to build another warehouse next to it.  In this case, think of object storage as a warehouse without a roof.  New boxes of files can be added almost indefinitely.

While that warehouse with the infinite amount of space sounds good in theory, you may have some trouble finding a specific box in that warehouse as it expands into infinity. Object storage addresses that limitation by allowing customizable metadata.  While a file storage system may only allow metadata to save the date, owner, location, and size of the box, the object storage system can customize the metadata, and object metadata lives directly in the object, rather than in a separate inode (this is useful as the amount of metadata that is desirable in a storage platform that is tens or hundreds of Petabytes is generally an order of magnitude greater than what conventional storage is designed to handle at scale).  Getting back to the warehouse example, along with the date, owner, location, and size, the object metadata could include the exact coordinates of the box, a detailed log of changes and access, and a list of the contents of the box.  Object storage systems replace the limited and rigid file system attributes of file level storage with highly customizable metadata that captures common object characteristics and can also hold application-specific information. Because object storage uses a flat namespace performance may suffer as your data warehouse explodes in size, but you’re not going to have to worry about finding what you need when you need it.

In addition, object storage systems substitute the locking requirements of file level storage to prevent multiple concurrent updates, which enables rollback features and the undeleting of objects as well as the ability to access prior versions of objects.

Object vs. File.  Here’s a brief overview of the main differences.

  • Performance.  Object storage performs best for big data and high storage throughput, file storage performs better for smaller files.  Scality’s Organic Ring offers high performance configurations for applications such as email and databases, however traditional storage in general still offers better performance for those use cases.
  • Geography.  Object storage data can be storage and shared across multiple geographic regions, file storage data is typically shared locally.  File data spread throughout geographic regions is typically read-only replicated copies of data.
  • Scalability.  Object storage offers almost infinite scalability, file storage does not scale nearly as well when you get into millions of files in a volume and petabytes and beyond of total capacity.
  • Analytics.  Object storage offers customizable metadata in a flat namespace and is not limited in the number of metadata tags, file storage is limited in that respect.

OBJECT STORAGE AND RESILIENCY

So, we now understand that object storage offers much greater scalability than traditional file storage, what about resiliency?  Traditional file storage systems are hampered by their inherent limitations in supporting massive capacity, most importantly with a sufficient amount of data protection.  As any backup administrator would know, it’s unrealistic to try and back up hundreds of petabytes (or more) of data in any reasonable amount of time.  Object systems directly address that issue as they are designed to not need backups. They aren’t intended to be backed up.  Rather than using traditional backups, an object storage infrastructure is designed to store data with sufficient redundancy so that data is never lost even when multiple components of the infrastructure have failed.

How is this achieved? The primary way this is achieved is by keeping multiple replicas of objects, ideally across a wide geographic area.  Because of the additional storage that replication requires, object storage systems implement an efficient erasure coding data protection method to supplement data replication. What is erasure coding? It uses an algorithm to create additional information that allows for recreating data from a subset of the original data, similar to RAID protection’s parity bits.  The degree of resiliency is generally configurable on all object storage systems. The higher the level of resiliency that the administrator chooses of course results in a larger storage capacity requirement. Erasure coding saves capacity but impacts performance, especially if erasure coding is performed across geographically dispersed nodes. Different vendors handle the performance balance between erasure coding and replication differently.  Geographic erasure coding is generally supported, however only using it locally and replicating data geographically with data reduction seems to strike a good balance between performance and resiliency.

OBJECT STORAGE EASES MANAGEMENT

Object storage systems are designed to minimize storage administration through automation, policy engines and self-correcting capabilities. They are designed for zero downtime and across all vendors administration tasks can be performed without service disruption.  This includes adding capacity, hardware maintenance and upgrades, and even migrating to a different data center.  The object storage policy engines enable the automation of object storage features such as when to change the number of replicas to address spikes in usage, when to use replication vs. erasure coding, and which data centers to store objects based on the relevant metadata.

OBJECT STORAGE AND APPLICATION ACCESSIBILITY

As you might expect, each object storage vendor has implemented its own proprietary APIs Object storage utilizing the REST API to use the various storage functions.  All object storage products also support the industry standard Amazon S3 API, which enjoys perhaps the largest level of application support.  It’s not surprising as Amazon S3 has extensive capabilities and supports complex storage operations.  Be aware that some object storage vendors only support an S3 API subset, and understanding the S3 API implementation’s limitations is absolutely key to ensuring the widest level of application support in your environment.

In addition to S3 most object storage vendors also support the OpenStack Swift API.  File system protocol support is common in object storage systems, but implementations of course vary by product.  As I mentioned earlier, the company I work for went with ECS.  EMC ECS has geo distributed active/active NFS support, a key feature, and it offers Hadoop HDFS interfaces which allows Hadoop to directly access data in an object stores.  With the ECS system’s consistency support it’s a very strong geo-distributed NAS product.  Different vendors of course have different strengths and weaknesses.  A strong competitor of EMC, Scality, claims that it has EMC Isilon-level NAS performance (although I haven’t tested that), and the NetApp StorageGRID Webscale now offers protocol duality by having a one-to-one relationship between objects and files.  Other object storage products provide different unique features as well.  Some offer file system support through their own or third-party cloud storage gateways, and some provide S3 compliant connectors that allow Hadoop to use object storage as an alternative to HDFS.

OBJECT STORAGE AND DATA ENCRYPTION

Public cloud storage is a very common use case for object storage, and encryption is obviously a must for public cloud storage.  Most object storage products support both at-rest and in-transit encryption, and most use an at-rest encryption approach where encryption keys are generated dynamically without a need for a separate key management system.  Some vendors (such as Cloudian and Amplidata) support client-managed encryption keys in addition to server-side managed encryption keys which gives cloud providers an option to allow their customers to manage their own keys.  LDAP and Active Directory authentication support of users accessing the object store are also commonly supported in current object storage systems.  If support of AWS v2 or v4 authentication is needed to provide access to vaults and vault objects, do your research as support is less common.

THE BEST USE CASES FOR OBJECT STORAGE

The ability of object storage to scale and accessibility via APIs makes them suitable in use cases where traditional storage systems just can’t compete, even in the NAS arena.  So, now that you know what object storage is, what can it be used for, and how can you take advantage of the improved scalability and accessibility?  While object storage is typically not well suited for relational database or any data that requires a large amount of random I/O, it has many possible use cases that I’ve outlined below.

Advantages

  • Highly Scalable capacity and performance
  • Low cost on commodity hardware at petabyte scale
  • Simplified management
  • Single Access Point/namespace for data

Disadvantages

  • No random access to files
  • Lower performance on a per-object basis compared to traditional storage
  • Integration may require modification of application and workflow logic
  • POSIX utilities will not work directly with object storage

Use Cases

  • Logging.  It is often used to capture large amounts of log data generated by devices and applications which are ingested into the object store via a message broker.
  • NAS.  Many companies are considering object storage as a NAS alternative, most notably if there is another use case that requires an object storage system and the two use cases can be combined.
  • Big Data.  Several object storage products offer certified S3 Hadoop Distributed File System interfaces that allow Hadoop to directly access data on the object store.
  • Content distribution network.  Many companies use an object storage implementation to globally distribute content (like media files) using policies to govern access, along with features like automatic object deletion based on expiration dates.
  • Backup and Archive of structured and semi-structured data.  Because object storage systems are cost-effective, many companies are looking to them as highly scalable backup and archive solution.
  • Content Repositories.  Object storage is often used as a content repository for images, videos and general media content accessed directly through applications or through file system protocols.
  • Enterprise Collaboration.  Because of the scale and resiliency of object storage across large geographic regions, distributed object storage systems are often used as collaboration platforms in large enterprises where content is accessed and shared around the globe.
  • Storage as a Service (SaaS).  Object storage is often used for private and public clouds of enterprises and internet service providers.

OBJECT STORAGE VENDORS AND PRODUCTS

There are numerous object storage vendors in the market today.  You can purchase a detailed vendor comparison of all the object storage vendors at the Evaluator Group (Evaluator Group Object Storage Comparison Matrix) , and view the Gartner Object storage comparison matrix for more detailed information.  According to Gartner, DellEMC, Scality, and IBM are the current market leaders with Hitachi as the strongest challenger.

Gartner Market Leaders for Object Storage:

EMC Elastic Cloud Storage (ECS)

EMC Delivers ECS as a  turnkey integrated appliance or as a software package to be installed and run on commodity hardware.  It features highly efficient strong consistency on access of geodistributed objects, and is designed from the ground up with geodistribution in mind.

Scality RING

It is provided as delivered software only to run on commodity hardware, it stores metadata in a custom-developed distributed database, and they claim EMC Isilon performance when it’s used as NAS.

IBM Clould Object Storage

It is provided as delivered software only to run on certified hardware, it is a multi-tiered architecture with no centralized servers, and it offers extreme scalability enabled by peer-to-peer communication of storage nodes.

Other Object Storage Vendors:

Hitachi Content Platform (HCP)

It is delivered as a turnkey integrated appliance or as a software package to run on commodity hardware or as a managed service hosted by HDS.  It offers extreme density with a single cluster able to support up to 800 million objects and 497 PB of addressable capacity, and an integrated portfolio: HCP cloud storage, HCP Anywhere File Sync & Share, and Hitachi Data Ingestor (HDI) for remote and branch offices.

NetApp StorageGRID Webscale

It is delivered as software appliance or as a turnkey integrated appliance, it stores metadata (including the physical location of objects) in a distributed NoSQL Cassandra database.

DDN WOS Object Storage

It is delivered as turnkey integrated appliance or as a software package to run on commodity hardware, It can be configured as small as one node to start and it is able to scale to hundreds of petabytes.

Caringo Swarm 8

It is delivered as a software package to run on commodity hardware. It offers out-of-box integration with Elastic search for fast object searching.

Red Hat Ceph

It is delivered as a software package to run on commodity hardware and is based on the open-source Reliable Autonomic Distributed Object Store (RADOS).  It features strong consistency on write performance.

Cloudian HyperStore

It is delivered as turnkey integrated appliance or as a software package to run on commodity hardware.  It stores metadata with objects but also in a distributed NoSQL Cassandra database for speed.

HGST Amplidata

It is delivered as a turnkey rack-level system and uses HGST Helium filled hard drives for power efficiency, reliability and capacity.

SwiftStack Object Storage System

It is delivered as a software package to run on commodity hardware and is based on OpenStack Swift, which is the enterprise offering of Swift with cluster and management tools and 24×7 support.

What is VPLEX?

vplexWe are looking at implementing a storage virtualization device and I started doing a bit of research on EMC’s product offering.  Below is a summary of some of the information I’ve gathered, including a description of what VPLEX does as well as some pros and cons of implementing it.  This is all info I’ve gathered by reading various blogs, looking at EMC documentation and talking to our local EMC reps.  I don’t have any first-hand experience with VPLEX yet.

What is VPLEX?

VPLEX at its core is a storage virtualization appliance. It sits between your arrays and hosts and virtualizes the presentation of storage arrays, including non-EMC arrays.  Instead of presenting storage to the host directly you present it to the VPLEX. You then configure that storage from within the VPLEX and then zone the VPLEX to the host.  Basically, you attach any storage to it, and like in-band virtualization devices, it virtualizes and abstracts them.

There are three VPLEX product offerings, Local, Metro, and Geo:

Local.  VPLEX Local manages multiple heterogeneous arrays from a single interface within a single data center location. VPLEX Local allows increased availability, simplified management, and improved utilization across multiple arrays.

Metro.  VPLEX Metro with AccessAnywhere enables active-active, block level access to data between two sites within synchronous distances.  Host application stability needs to be considered. It is recommended that depending on the application that consideration for Metro be =< 5ms latency. The combination of virtual storage with VPLEX Metro and virtual servers allows for the transparent movement of VM’s and storage across longer distances and improves utilization across heterogeneous arrays and multiple sites.

Geo.  VPLEX Geo with AccessAnywhere enables active-active, block level access to data between two sites within asynchronous distances. Geo improves the cost efficiency of resources and power.  It provides the same distributed device flexibility as Metro but extends the distance up to 50ms of network latency. 

Here are some links to VPLEX content from EMC, where you can learn more about the product:

What are some advantages of using VPLEX? 

1. Extra Cache and Increased IO.  VPLEX has a large cache (64GB per node) that sits in-between the host and the array. It offers additional read cache that can greatly improve read performance on databases because the additional cache is offloaded from the individual arrays.

2. Enhanced options for DR with RecoverPoint. The DR benefits are increased when integrating RecoverPoint with VPLEX Metro or Geo to replicate the data using real time replication. It includes a capacity based journal for very granular rollback capabilities (think of it as a DVR for the data center).  You can also use the native bandwidth reduction features (compression & deduplication) or disable them if you have WAN optimization devices installed like those from Riverbed.  If you want active/active read/write access to data across a large distance, VPLEX is your only option.  NetApp’s V-Series and HDS USPV can’t do it unless they are in the same data center. Here’s a few more advantages:

  • DVR-like recovery to any point in time
  • Dynamic synchronous and asynchronous replication
  • Customized recovery point opbjectives that support any-to-any storage arrays
  • WAN bandwidth reduction of up to 90% of changed data
  • Non-disruptive DR testing

4. Non disruptive data mobility & reduced maintenance costs. One of the biggest benefits of virtualizing storage is that you’ll never have to take downtime for a migration again. It can take months to migrate production systems and without virtualization downtime is almost always required. Also, migration is expensive, it takes a great deal of resources from multiple groups as well as the cost of keeping the older array on the floor during the process. Overlapping maintenance costs are expensive too.  By shortening the migration timeframe hardware maintenance costs will drop, saving money.   Maintenance can be a significant part of the storage TCO, especially if the arrays are older or are going to be used for a longer period of time.  Virtualization can be a great way to reduce those costs and improve the return on assets over time.

5. Flexibility based on application IO.  The ability to move and balance LUN I/O among multiple smaller arrays non-disruptively would allows you to balance workloads and increase your ability to respond to performance demands quickly.  Note that underlying LUNs can be aggregated or simply passed through the VPLEX.

6. Simplified Management and vendor neutrality.   Implementing VPLEX for all storage related provisioning tasks would reduce complexity with multiple vendor arrays.  It allows you to manage multiple heterogeneous arrays from a single interface.  It also makes zoning easier as all hosts would only need to be zoned to the VPLEX rather than every array on the floor, which makes it faster and easier to provision new storage to a new host.

7. Increased leverage among vendors.  This advantage would be true with any virtualization device.  When controller based storage virtualization is employed, there is more flexibility to pit vendors against each other to get the best hardware, software and maintenance costs.  Older arrays could be commoditized which could allow for increased leverage to negotiate the best rates.

8. Use older arrays for Archiving. Data could be seamlessly demoted or promoted to different arrays based on an array’s age, it’s performance levels and it’s related maintenance costs.  Older arrays could be retained for capacity and be demoted to a lower tier of service, and even with the increased maintenance costs it could still save money.

9. Scale.  You can scale it out and add more nodes for more performance when needed.  With a VPLEX Metro configuration, you could configure VPLEX with up to 16 nodes in the cluster between the two sites.

What are some possible disadvantages of VPLEX?

1. Licensing Costs. VPLEX is not cheap.  Also, it can be licensed per frame on VNX but must be licensed per TB on CX series.  Your large,older CX arrays will cost you a lot more to license.

2. It’s one more device to manage.   The VPLEX is an appliance, and it’s one more thing (or things) that has to be managed and paid for.

3. Added complexity to infrastructure.  Depending on the configuration, there could be multiple VPLEX appliances at every site, adding considerable complexity to the environment.

4. Managing mixed workloads in virtual enviornments.  When heavy workloads are all mixed together on the same array there is no way to isolate them, and the ability to migrate that workload non-disruptively to another array is one of the reasons to implement a VPLEX.  In practice, however, those VMs may end up being moved to another array with the same storage limitations as where they came from.  The VPLEX could be simply temporarily solving a problem by moving that problem to a different location.

5. Lack of advanced features. The VPLEX has no advanced storage features such as snapshots, deduplication, replication, or thin provisioning.  It relies on the underlying storage array for those type of features.  As an example, you may want to utilize block based deduplication with an HDS array by placing a Netapp V-series in front of it and using Netapp’s dedupe to enable it.  It is only possible to do that with a Netapp Vseries or HDS USP-V type device, the VPLEX can’t do that.

6. Write cache performance is not improved.  The VPLEX uses write-through caching while their competitor’s storage virtualization devices use write-back caching. When there is a write I/O in a VPLEX environment the I/O is cached on the VPLEX, however it is passed all the way back to the virtualized storage array before an ack is sent back to the host.  The Netapp V-Series and HDS USPV will store the I/O in their own cache and immediately return an ack to the host.  At that point the I/Os are flushed to the back end storage array using their respective write coalescing & cache flushing algorithms.  Because of the write-back behavior it is possible for a possible performance gain above and beyond the performance of the underlying storage arrays due to the caching on these controllers.  There is no performance gain for write I/O in VPLEX environments beyond the existing storage due to the write-through cache design.

What is EMC’s CAVA / Common Event Enabler?

I was recently asked to do a bit of research on EMC’s CAVA product, as we are looking for AntiVirus solutions for our CIFS based shares.  I found very little info with general google searches about exactly what CAVA is and what it does, so I thought I’d share some of the information that I did find after a bit of research and talking to my local EMC rep. 

Basically CAVA is a service runs on the Celerra (or VNX) data mover in conjunction with a Windows server running a 3rd Party Anti-Virus engine (along with EMC’s CAVA API agent) to handle the conversation.  It only facilitates the communication to an existing AV server, EMC doesn’t provide the actual AV software.  It supports Symantec, McAfee, eTrust, Sophos, Kaspersky, and Trend Micro.  In a nutshell, CAVA employs three key components:  Software on the data mover (VC Client), Software on a windows AV server (CAVA), and your 3rd party AV engine on a Windows server. 

CAVA used to stand for “Celerra Anti Virus Agent”, but was changed to “Common AntiVirus Agent”.  Quite convenient that they could re-use the “C” without changing the acronym, right? The product is now officially known as “Common Event Enabler for Windows” by EMC and the package includes CEPA, or the EMC Common Event Publishing Agent, and CAVA, the aforementioned Common Antivirus Agent.  For this post I’m focusing on the Antivirus agent.

CAVA is a fairly straightforward install, however if implemented incorrectly it can adversely affect your performance. It’s important to know how it scans your files and essential to know how to troubleshoot it and do performance monitoring.  There is definitely a performance hit when using CAVA. 

When are files scanned for a virus? 

Each time the Celerra receives a file, it will be locked for read access first, at which time a request is sent to the AV server (or servers) to scan the file.  The Celerra will send the UNC path name to the windows server and wait for verification that the file is not affected.  Once that verification is complete, the file is made available for user access. 

CAVA will scan a file in the following instances: 

  •          CAVA will scan files for a virus the first time that a file is read, subsequent to the initial implementation of CAVA and any updates to virus definitions.
  •          Creating, modifying, or moving a file
  •          When restoring a file (or files) from backup
  •          When renaming a file with a different file extension
  •          Whenever an administrator performs a full file system scan (with the server_viruschk command) 

What are the features of CAVA? 

  •          Automatic Virus Definition Updates. Files opened after the update will be re-scanned.
  •          CAVA Calculator (a free sizing tool to assist in implementation)
  •          User Notifications on Virus detection, cofigurable by administrators to be sent as notifications to the client, event log entries, or both.
  •          Scan on read can be enabled
  •          Event reporting and configuration 

What are some implementation considerations? 

  •          EMC recommends that an MPFS client system not be configured as the AV server system.
  •          CAVA doesn’t support a data mover CIFS server using share level access.
  •          Always update the viruschecker.conf file to avoid scanning temp files. It can be modified with the Celerra AV Management Snap-In.
  •          It’s CIFS only. There is no support for NFS or FTP.  If those protocols are used to open, modify, or move files the files will not be scanned.
  •          You must check for compatibility with your installed 3rd party AV software.

How is it licensed, and how much does it cost?

CAVA is licensed per array, on the VNX series it is in the Security and Compliance Suite.   Pricing will vary of course, but it’s not very expensive relative to the cost of the array.  It should be in the range of thousands rather than tens of thousands of dollars.

 

Making a case for file archiving

We’ve been investigating options for archiving unstructured (file based) data that resides on our Celerra for a while now. There are many options available, but before looking into a specific solution I was asked to generate a report that showed exactly how much of the data has been accessed by users for the last 60 days and for the last 12 months.  As I don’t have permissions to the shared folders from my workstation I started looking into ways to run the report directly from the Celerra control station.  The method I used will also work on VNX File.

After a little bit of digging I discovered that you can access all of the file systems from the control station by navigating to /nas/quota/slot_.  The slot_2 folder would be for the server_2 data mover, slot_3 would be for server_3, etc.  With full access to all the file systems, I simply needed to write a script that scanned each folder and counted the number of files that had been modified within a certain time window.

I always use excel for scripts I know are going to be long.  I copy the file system list from Unisphere then put the necessary commands in different columns, and end it with a concatenate formula that pulls it all together.  If you put echo -n in A1, “Users_A,” in B1, and >/home/nasadmin/scripts/Users_A.dat in C1, you’d just need to type the formula “=CONCATENATE(A1,B1,C1)” into cell D1.  D1 would then contain echo -n “Users_A,” > /home/nasadmin/scripts/Users_A.dat. It’s a simple and efficient way to make long scripts very quickly.

In this case, the script needed four different sections.  All four of these sections I’m about to go over were copied into a single shell script and saved in my /home/nasadmin/scripts directory.  After creating the .sh file, I always do a chmod +X and chmod 777 on the file.  Be prepared for this to take a very long time to run.  It of course depends on the number of file systems on your array, but for me this script took about 23 hours to complete.

First, I create a text file for each file system that contains the name of the filesystem (and a comma) which is used later to populate the first column of the final csv output.  It’s of course repeated for each file system.

echo -n "Users_A," > home/nasadmin/scripts/Users_A.dat
echo -n "Users_B," > home/nasadmin/scripts/Users_B.dat

... <continued for each filesystem>
 Second, I use the ‘find’ command to walk each directory tree and count the number of files that were accessed over 60 days ago.  The output is written to another text file that will be used in the csv output file later.
find /nas/quota/slot_2/ Users_A -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_A_wc.dat

find /nas/quota/slot_2/ Users_B -mtime +365 | wc -l > /home/nasadmin/scripts/ Users_B_wc.dat

... <continued for each filesystem>
 Third, I want to count the total number of files in each file system.  A third text file is written with that number, again for the final combined report that’s generated at the end.
find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

find /nas/quota/slot_2/Users_B | wc -l > /home/nasadmin/scripts/Users_B_total.dat

... <continued for each filesystem>
 Finally, each file is combined into the final report.  The output will show each filesystem with two columns, Total Files & Files Accessed 60 days ago.  You can then easily update the report in Excel and add columns that show files accessed in the last 60 days, the percentage of files accessed in the last 60 days, etc., with some simple math.
cat /home/nasadmin/scripts/Users_A.dat /home/nasadmin/scripts/Users_A_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_A_total.dat | tr -d "\n" > /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

cat /home/nasadmin/scripts/Users_B.dat /home/nasadmin/scripts/Users_B_wc.dat /home/nasadmin/scripts/comma.dat /home/nasadmin/scripts/Users_B_total.dat | tr -d "\n" >> /home/nasadmin/scripts/fsoutput.csv | echo " " > /home/nasadmin/scripts/fsoutput.csv

... <continued for each filesystem>

My final output looks like this:

Total Files Accessed 60+ days ago Accessed in Last 60 days % Accessed in last 60 days
Users_A            827,057                734,848               92,209                                                   11.15
Users_B              61,975                  54,727                 7,248                                                   11.70
Users_C            150,166                132,457               17,709                                                   11.79

The three example filesystems above show that only about 11% of the files have been accessed in the last 60 days.   Most user data has a very short lifecycle, it’s ‘hot’ for a month or less then dramatically tapers off as the business value of the data drops.  These file systems would be prime candidates for archiving.

My final report definitely supported the need for archving, but we’ve yet to start a project to complete it.  I like the possibility of using EMC’s cloud tiering appliance which can archive data directly to the cloud service of your choice.  I’ll make another post in the future about archiving solutions once I’ve had more time to research it.