I recently took a new job in the Enterprise storage group at a large company, and as part of my new responsibilities I will be implementing, configuring, and managing our object storage. We will be using EMC’s ECS solution, however I’ve been researching object storage as a platform in general to learn more about it, it’s capabilities, it’s use cases, and it’s general management. This blog post is a summary of the information I’ve assimilated about object storage along with some additional information about the vendors who offer object storage solutions.
At its core, what is object based storage? Object based storage is an architecture that manages data as objects as opposed to other architectures such as file systems that manage data as a file hierarchy, and block storage that manages data as blocks within sectors & tracks. To put it another way, Object Storage is not directly accessed by the operating system and is not seen as a local or remote filesystem by an operating system. Interaction with data occurs only at the application level via an API. While Block Storage and File Storage are designed to be consumed by your operating system, Object Storage is designed to be consumed by your application. What are the implications of this?
- Byte level interaction with data is no longer possible. Data objects are stored or retrieved in their entirety with a single command. This is results in powerful scalability by making all file I/O sequential, which is much higher performance than random I/O.
- It allows for easier application development by providing a higher level of abstraction than traditional storage platforms.
- Interaction can happen through a single API endpoint. This removes complex storage network topologies from the design of the application infrastructure, and dramatically reduces security vulnerabilities as the only available access is the HTTP/HTTPS API and the service providing the API functionality.
- Filesystem level utilities cannot interact directly with Object Storage.
- Object Storage is one giant volume, resulting in almost all storage management overhead of Block and File Storage being eliminated.
Object storage is designed to be more scalable than traditional block and file storage and it specifically targets unstructured content. In order to achieve improved scalability over traditional file storage it bundles the data with additional metadata tags and a unique identifier. The metadata is completely customizable, allowing an administrator to input much more identifying information for each data object. The objects are also stored in a flat address space which makes it much easier to locate and retrieve data across geographic regions.
Object storage began as a niche technology for enterprises, however it quickly became one of the basic general underlying technologies of cloud storage. Object storage has become a valid alternative to file based storage due to rapid data growth and the proliferation of big data in the enterprise, an increased demand for private and hybrid cloud storage and a growing need for customizable and scalable storage infrastructures. The number of object storage products has been rapidly expanding from both major storage vendors and startup companies in recent years to accommodate the increasing demand. Because many vendors offer Object Storage platforms that run on commodity hardware, even with data protection overhead the price point is typically very attractive when compared to traditional storage.
WHAT’S THE DIFFERENCE BETWEEN OBJECT AND FILE STORAGE?
Object storage technology has been finding its way into file based use cases at many companies. In some cases object storage vendors are positioning their products as viable NAS alternatives. To address the inherent limitations of traditional file and block level storage to reliably support a huge amount of data and at the same time be cost-effective, object storage focuses on scalability, resiliency, security and manageability. So, what’s the difference between the two? In general, the difference is in it’s performance, geographic distribution, scalability, and analytics.
OBJECT STORAGE IS HIGHLY SCALABLE
Scalability is a major issue in storage, and it’s only increasing as time goes on. If you need to scale into the petabytes and beyond, you many need to scale in an order of magnitude greater than what a traditional single storage system is capable of. As traditional storage systems aren’t going to scale to that magnitude, a different type of storage needs to be considered that can still be cost effective, and object storage fills that need very well.
Object storage overcomes many of the scalability limitations that file storage faces. I really liked the warehouse example that Cloudian used on their website, so I’m going to summarize that. If you think of file storage as a large warehouse, when you first store a box of files your warehouse looks almost empty and your available space looks infinite. As your storage needs expand that warehouse fills up before you know it, and being in the big city there’s no room to build another warehouse next to it. In this case, think of object storage as a warehouse without a roof. New boxes of files can be added almost indefinitely.
While that warehouse with the infinite amount of space sounds good in theory, you may have some trouble finding a specific box in that warehouse as it expands into infinity. Object storage addresses that limitation by allowing customizable metadata. While a file storage system may only allow metadata to save the date, owner, location, and size of the box, the object storage system can customize the metadata, and object metadata lives directly in the object, rather than in a separate inode (this is useful as the amount of metadata that is desirable in a storage platform that is tens or hundreds of Petabytes is generally an order of magnitude greater than what conventional storage is designed to handle at scale). Getting back to the warehouse example, along with the date, owner, location, and size, the object metadata could include the exact coordinates of the box, a detailed log of changes and access, and a list of the contents of the box. Object storage systems replace the limited and rigid file system attributes of file level storage with highly customizable metadata that captures common object characteristics and can also hold application-specific information. Because object storage uses a flat namespace performance may suffer as your data warehouse explodes in size, but you’re not going to have to worry about finding what you need when you need it.
In addition, object storage systems substitute the locking requirements of file level storage to prevent multiple concurrent updates, which enables rollback features and the undeleting of objects as well as the ability to access prior versions of objects.
Object vs. File. Here’s a brief overview of the main differences.
- Performance. Object storage performs best for big data and high storage throughput, file storage performs better for smaller files. Scality’s Organic Ring offers high performance configurations for applications such as email and databases, however traditional storage in general still offers better performance for those use cases.
- Geography. Object storage data can be storage and shared across multiple geographic regions, file storage data is typically shared locally. File data spread throughout geographic regions is typically read-only replicated copies of data.
- Scalability. Object storage offers almost infinite scalability, file storage does not scale nearly as well when you get into millions of files in a volume and petabytes and beyond of total capacity.
- Analytics. Object storage offers customizable metadata in a flat namespace and is not limited in the number of metadata tags, file storage is limited in that respect.
OBJECT STORAGE AND RESILIENCY
So, we now understand that object storage offers much greater scalability than traditional file storage, what about resiliency? Traditional file storage systems are hampered by their inherent limitations in supporting massive capacity, most importantly with a sufficient amount of data protection. As any backup administrator would know, it’s unrealistic to try and back up hundreds of petabytes (or more) of data in any reasonable amount of time. Object systems directly address that issue as they are designed to not need backups. They aren’t intended to be backed up. Rather than using traditional backups, an object storage infrastructure is designed to store data with sufficient redundancy so that data is never lost even when multiple components of the infrastructure have failed.
How is this achieved? The primary way this is achieved is by keeping multiple replicas of objects, ideally across a wide geographic area. Because of the additional storage that replication requires, object storage systems implement an efficient erasure coding data protection method to supplement data replication. What is erasure coding? It uses an algorithm to create additional information that allows for recreating data from a subset of the original data, similar to RAID protection’s parity bits. The degree of resiliency is generally configurable on all object storage systems. The higher the level of resiliency that the administrator chooses of course results in a larger storage capacity requirement. Erasure coding saves capacity but impacts performance, especially if erasure coding is performed across geographically dispersed nodes. Different vendors handle the performance balance between erasure coding and replication differently. Geographic erasure coding is generally supported, however only using it locally and replicating data geographically with data reduction seems to strike a good balance between performance and resiliency.
OBJECT STORAGE EASES MANAGEMENT
Object storage systems are designed to minimize storage administration through automation, policy engines and self-correcting capabilities. They are designed for zero downtime and across all vendors administration tasks can be performed without service disruption. This includes adding capacity, hardware maintenance and upgrades, and even migrating to a different data center. The object storage policy engines enable the automation of object storage features such as when to change the number of replicas to address spikes in usage, when to use replication vs. erasure coding, and which data centers to store objects based on the relevant metadata.
OBJECT STORAGE AND APPLICATION ACCESSIBILITY
As you might expect, each object storage vendor has implemented its own proprietary APIs Object storage utilizing the REST API to use the various storage functions. All object storage products also support the industry standard Amazon S3 API, which enjoys perhaps the largest level of application support. It’s not surprising as Amazon S3 has extensive capabilities and supports complex storage operations. Be aware that some object storage vendors only support an S3 API subset, and understanding the S3 API implementation’s limitations is absolutely key to ensuring the widest level of application support in your environment.
In addition to S3 most object storage vendors also support the OpenStack Swift API. File system protocol support is common in object storage systems, but implementations of course vary by product. As I mentioned earlier, the company I work for went with ECS. EMC ECS has geo distributed active/active NFS support, a key feature, and it offers Hadoop HDFS interfaces which allows Hadoop to directly access data in an object stores. With the ECS system’s consistency support it’s a very strong geo-distributed NAS product. Different vendors of course have different strengths and weaknesses. A strong competitor of EMC, Scality, claims that it has EMC Isilon-level NAS performance (although I haven’t tested that), and the NetApp StorageGRID Webscale now offers protocol duality by having a one-to-one relationship between objects and files. Other object storage products provide different unique features as well. Some offer file system support through their own or third-party cloud storage gateways, and some provide S3 compliant connectors that allow Hadoop to use object storage as an alternative to HDFS.
OBJECT STORAGE AND DATA ENCRYPTION
Public cloud storage is a very common use case for object storage, and encryption is obviously a must for public cloud storage. Most object storage products support both at-rest and in-transit encryption, and most use an at-rest encryption approach where encryption keys are generated dynamically without a need for a separate key management system. Some vendors (such as Cloudian and Amplidata) support client-managed encryption keys in addition to server-side managed encryption keys which gives cloud providers an option to allow their customers to manage their own keys. LDAP and Active Directory authentication support of users accessing the object store are also commonly supported in current object storage systems. If support of AWS v2 or v4 authentication is needed to provide access to vaults and vault objects, do your research as support is less common.
THE BEST USE CASES FOR OBJECT STORAGE
The ability of object storage to scale and accessibility via APIs makes them suitable in use cases where traditional storage systems just can’t compete, even in the NAS arena. So, now that you know what object storage is, what can it be used for, and how can you take advantage of the improved scalability and accessibility? While object storage is typically not well suited for relational database or any data that requires a large amount of random I/O, it has many possible use cases that I’ve outlined below.
- Highly Scalable capacity and performance
- Low cost on commodity hardware at petabyte scale
- Simplified management
- Single Access Point/namespace for data
- No random access to files
- Lower performance on a per-object basis compared to traditional storage
- Integration may require modification of application and workflow logic
- POSIX utilities will not work directly with object storage
- Logging. It is often used to capture large amounts of log data generated by devices and applications which are ingested into the object store via a message broker.
- NAS. Many companies are considering object storage as a NAS alternative, most notably if there is another use case that requires an object storage system and the two use cases can be combined.
- Big Data. Several object storage products offer certified S3 Hadoop Distributed File System interfaces that allow Hadoop to directly access data on the object store.
- Content distribution network. Many companies use an object storage implementation to globally distribute content (like media files) using policies to govern access, along with features like automatic object deletion based on expiration dates.
- Backup and Archive of structured and semi-structured data. Because object storage systems are cost-effective, many companies are looking to them as highly scalable backup and archive solution.
- Content Repositories. Object storage is often used as a content repository for images, videos and general media content accessed directly through applications or through file system protocols.
- Enterprise Collaboration. Because of the scale and resiliency of object storage across large geographic regions, distributed object storage systems are often used as collaboration platforms in large enterprises where content is accessed and shared around the globe.
- Storage as a Service (SaaS). Object storage is often used for private and public clouds of enterprises and internet service providers.
OBJECT STORAGE VENDORS AND PRODUCTS
There are numerous object storage vendors in the market today. You can purchase a detailed vendor comparison of all the object storage vendors at the Evaluator Group (Evaluator Group Object Storage Comparison Matrix) , and view the Gartner Object storage comparison matrix for more detailed information. According to Gartner, DellEMC, Scality, and IBM are the current market leaders with Hitachi as the strongest challenger.
Gartner Market Leaders for Object Storage:
EMC Delivers ECS as a turnkey integrated appliance or as a software package to be installed and run on commodity hardware. It features highly efficient strong consistency on access of geodistributed objects, and is designed from the ground up with geodistribution in mind.
It is provided as delivered software only to run on commodity hardware, it stores metadata in a custom-developed distributed database, and they claim EMC Isilon performance when it’s used as NAS.
It is provided as delivered software only to run on certified hardware, it is a multi-tiered architecture with no centralized servers, and it offers extreme scalability enabled by peer-to-peer communication of storage nodes.
Other Object Storage Vendors:
It is delivered as a turnkey integrated appliance or as a software package to run on commodity hardware or as a managed service hosted by HDS. It offers extreme density with a single cluster able to support up to 800 million objects and 497 PB of addressable capacity, and an integrated portfolio: HCP cloud storage, HCP Anywhere File Sync & Share, and Hitachi Data Ingestor (HDI) for remote and branch offices.
It is delivered as software appliance or as a turnkey integrated appliance, it stores metadata (including the physical location of objects) in a distributed NoSQL Cassandra database.
It is delivered as turnkey integrated appliance or as a software package to run on commodity hardware, It can be configured as small as one node to start and it is able to scale to hundreds of petabytes.
It is delivered as a software package to run on commodity hardware. It offers out-of-box integration with Elastic search for fast object searching.
It is delivered as a software package to run on commodity hardware and is based on the open-source Reliable Autonomic Distributed Object Store (RADOS). It features strong consistency on write performance.
It is delivered as turnkey integrated appliance or as a software package to run on commodity hardware. It stores metadata with objects but also in a distributed NoSQL Cassandra database for speed.
It is delivered as a turnkey rack-level system and uses HGST Helium filled hard drives for power efficiency, reliability and capacity.
It is delivered as a software package to run on commodity hardware and is based on OpenStack Swift, which is the enterprise offering of Swift with cluster and management tools and 24×7 support.