Analytics for Object Storage Simplified - Unified File and Object for Hadoop Sandeep R Patil STSM, Master Inventor, IBM Spectrum Scale Smita Raut Object Development Lead, IBM Spectrum Scale Acknowledgement : Bill Owen, Tomer Perry, Dean Hildebrand, Piyush Chaudhary, Yong Zeng, Wei Gong, Theodore Hoover Jr, Muthuannamalai Muthiah.
Agenda • Part 1 : Need as well as Design Points for Unified File and Object Ø Introduction to Object Storage Ø Unified File & Object Access Ø Use Cases Enabled By UFO • Part 2: Analytics with Unified File and Object Ø Big Data and Challanges Ø Design Points, Approach and Solution
Part 1 : Need as well as Design Points for Unified File and Object Ø Object Storage Introduction 3
Introduction to Object Store • Object storage is highly available, distributed, eventually consistent storage. • Data is stored as individual objects with unique identifier Flat addressing scheme that allows for greater scalability ● • Has simpler data management and access • REST-based data access • Simple atomic operations: • PUT, POST, GET, DELETE Usually software based that runs on commodity hardware ● Capable of scaling to 100s of petabytes ● Uses replication and/or erasure coding for availability instead of RAID ● Access over RESTful API over HTTP, which is a great fit for cloud and mobile applications ● – Amazon S3, Swift, CDMI API 4
Object Storage Enables The Next Generation of Data Management Simple APIs/Semanti Scalable cs (Swift/S3, Ubiquitous Multi-Tenancy Metadata Versioning, Access Access Whole File Updates) Simpler Scalable and Multi-Site management Cost Savings Highly- Cloud and flatter Available Storage namespace
But Does it Create Yet Another Storage Island in Your Data Center…?? 6
Ø Unified File and Object Access 7
What is Unified File and Object Access ? • Accessing object using file interfaces NFS/SMB/POSIX Object(http) Files accessed as (SMB/NFS/POSIX) and accessing file using object Data ingested Objects interfaces (REST) helps legacy applications as Files 3 4 designed for file to seamlessly start integrating into the object world. 2 1 Objects accessed • It allows object data to be accessed using Data ingested as Files applications designed to process files . It allows file as Objects data to be published as objects. • Multi protocol access for file and object in the same <Container> namespace (with common User ID management capability) allows supporting and hosting data oceans Swift (With Swift on File) of different types of data with multiple access options. <Clustered file system> • Optimizes various use cases and solution architectures resulting in better efficiency as well as cost savings. File Exports created on container level OR POSIX access from container level 8
Flexible Identity Management Modes • Two Identity Management Modes • Administrators can choose based on their need and use-case Identity Management Modes Suitable for unified file and object access for Suitable when auth schemes for file and end users. Leverage common ILM policies object are different and unified access for file and object data based on data is for applications ownership Local_Mode Unified_Mode Object created by Object interface Object created from Object interface should be owned by the user doing the Object PUT (i.e will be owned by internal “swift” user FILE will be owned by UID/GID of the user) Application processing the object data Owner of the object will own and from file interface will need the required have access to the data from file file ACL to access the data. interface. Users from Object and File are expected to be Object authentication setup common auth and coming from same directory is independent of File service (only AD+RFC 2307 or LDAP) Authentication setup 9
Ø Use Cases Enabled by Unified File Object 10
Use case 1 : Process Object Data with File-Oriented Applications and Publish Outcomes as Objects Final processed videos available as Media House OpenStack Cloud Platform Objects in container which is used for external publishing (Tenant = Media House Subsidiaries) Publishing Channels VM Farm for Subsidiary 1 VM Farm for Subsidiary 2 for video processing for video processing Subsidiary 1 Subsidiary 2 Ingest Final Video (as objects) Media Objects available for streaming Virtual Virtual Virtual Virtual …. …. Machine Machine Machine Machine Instances Instances Instances Instances Container Container1 1’ Container2 Raw media content sent for media processing which happens over files Swift on file NFS Export NFS Export NFS Export (Object to File access) on on on Container 1’ Container 1 Container 2 Manila Shares (NFS) exported only for Subsidiary2 Manila Shares (NFS) exported only for Subsidiary1 Files converted into objects for publishing (File to Object access) 11
Use case 2 : Users read/write data via File and Object with Common User Authentication and Identity User: Riya User: John Corporate User Access Common Data using the same User Credentials across all protocols Directory (Active Directory/LDAP) User: Riya UID: 1001 GID: 2000 Domain: XYZ O O N S N S b b Riya’s data Read/Written je F M je F M S B S B c c from Object should be t t owned by Riya when Data Data accessed from File (SMB/NFS/POSIX) Clustered file system 12
We have now understood Part 1: Need as well as Design for Unified File and Object ….. Let us now deep dive on Part 2: Analytics with Unified File and Object
Ø Big Data and Challenges 14
Big Data § Big data is a term for data sets that are so large or complex that traditional data processing applications ( database management tools or traditional data processing applications ) are inadequate. § The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Characteristics § Volume – The quantity of generated and stored data. § Variety ‒ The type and nature of the data. § Velocity ‒ Speed at which the data is generated and processed § Variability ‒ Inconsistency of data sets can hamper processes manage it § Veracity ‒ Quality of captured data can vary greatly, affecting accuracy
Challenges with the Early Big Data Storage Models It’s not just Key Business processes are ! ! one type of now depend on the analytics analytics . . . . . . Ingest data at Move data to the Perform Repeat! various end points analytics engine analytics More data source than ever It takes hours or days Can’t just throw away data due to ! before, not just data you own, ! ! to move the data! regulations or business requirement but public or rented data
Ø Design Points, Approach & Solution 17
What are the Solution Design Points that we came across? Traditional New Gen Compute 6 Bring analytics to the data applications applications farm common name space Single Name Space to 1 house all your data (Files and Object) 2 Unified data access with File & Object Encryption for protection of your data 3 Geographically dispersed Powered by 4 Optimize economics based on value of the data 5 management of data including disaster recovery Flash Disk Tape Shared Nothing Off Premise Cluster
How Did We Approach The Solution & address the Design Points? - Took the Data Ocean Approach Meeting Design Point 2 Unified File and Object – as explained previously Users and applications New Gen Client Traditional Compute applications workstations applications farm Meeting Design Point 6 File Block Analytics OpenStack Object Meeting Design Point 3 4000+ POSIX Cinder Manilla Transparent customers HDFS NFS SMB iSCSI Glance Swift S3 Encryption Spark DR Site Global Namespace Powered by Clustered File System Meeting Design Point 1 Site A Automated data placement and data migration Site B Transparent Cloud Tier Spectrum Scale Site C RAID Flash Disk Tape Shared Nothing Worldwide Data JBOD/JBOF Cluster Distribution Meeting Design Point 4 Meeting Design Point 5 | 19
Meeting Design Point 6 – Bring Analytics to Data Apache Hadoop - Key Platform for Big Data and Analytics § An open-source software framework and most popular BD&A platform § Designed for distributed storage and processing of very large data sets on computer clusters built from commodity hardware § Core of Hadoop consists of A processing part called MapReduce o A storage part, known as Hadoop Distributed File System (HDFS) o Hadoop common libraries and components o § Leading Hadoop Distro: HortonWorks, CloudEra, MapR, IBM IOP/BigInsights
Meeting Design Point 6 – Bring Analytics to Data HDFS Shortcomings § HDFS is a shared nothing architecture , which is very inefficient for high throughput jobs (disks and cores grow in same ratio) § Costly data protection : uses 3-way replication; limited RAID/erasure coding § § Works only with Hadoop i.e weak support for File or Object protocols § Clients have to copy data from enterprise storage to HDFS in order to run Hadoop jobs, this can result in running on stale data.
Meeting Design Point 6 – How to Bring Analytics to Data ? Desired Solution: Need In place Analytics (No Copies Required). Clustered Filesystem should support HDFS Connectors
Recommend
More recommend