Alluxio: Open Source Data Orchestration for Analytics and AI in the Cloud Haoyuan (H.Y.) Li | Founder, Chairman & CTO | haoyuan@alluxio.com 2019-11-18 @ PDSW 2019
The Alluxio Story Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2013 Open Source project established & company to commercialize Alluxio founded 2015 Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2018 2019
Early Days Contributors Growth 100+ Contributors Growth 70 46 30 15 3 1 v0.1 v0.2 v0.3 v0.4 v0.5 v0.6 v0.7 Dec ‘12 Apr ‘13 Oct ‘13 Feb ‘14 Jul ‘14 Mar ‘15 Jul ‘15
Open Source Started From UC Berkeley AMPLab 1000+ contributors & Apache 2.0 Licensed growing GitHub’s Top 100 Most Valuable Repositories 4000+ Git Stars Join the Out of 96 Million conversation on Slack slackin.alluxio.io
Companies Running Alluxio (Learn More) Financial Services Retail & Entertainment Data & Analytics Services Technology Consumer Telco & Media Travel & Transportation
4 big trends driving the need for a new architecture Rise Separation of Hybrid – Multi Self-service Compute & cloud of the object data across the Storage environments enterprise store
Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE COMPUTE STORAGE STORAGE
Data Ecosystem 1.0 –TheChallenges COMPUTE Complex Low performance Expensive S TORAGE
Data stack journey and innovation paths Support more frameworks Co-located Disaggregated Support Presto, Spark Co-located Disaggregated across DCs without compute & HDFS compute & HDFS app changes on the same cluster on the same cluster HDFS for Hybrid Cloud Hive MR / Hive Burst HDFS data in HDFS the cloud, HDFS public or private Transition to Object store § Typically compute-bound § Compute & I/O can be clusters over 100% capacity scaled independently but Enable & accelerate § Compute & I/O need to be I/O still needed on HDFS big data on scaled together even when which is expensive not needed object stores
Independent scaling of compute & storage POSIX Interface Java File API HDFS Interface S3 Interface REST API Data Orchestration for the Cloud HDFS Driver Swift Driver S3 Driver NFS Driver
APIs to Interact with data in Alluxio Application have great flexibility to read / write data with many options Spark > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) Presto CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') POSIX $ cat /mnt/alluxio/myInput Java FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
Challenges with supporting more frameworks across data centers On-premise satellite Support more frameworks compute clusters across data centers Data center A § Running new frameworks on existing an HDFS cluster can dramatically affect Presto performance of existing workloads § Orchestrating data to compute clusters in Alluxio another data center is typically a manual effort and time consuming § Storing and managing multiple copies of the data becomes expensive Hive MapReduce Data center B
Challenges with running workloads on cloud storage Accelerate analytical frameworks Compute caching for S3 / GCS on the public cloud § S3 performance is variable and consistent query SLAs are hard to achieve Spark Spark Spark Spark § S3 metadata operations are expensive Alluxio Alluxio making workloads run longer Alluxio Alluxio § S3 egress costs add up making the Same instance / container solution expensive § S3 is eventually consistent making it hard to predict query results or
Challenges with Hybrid Cloud Burst big data workloads in HDFS for Hybrid Cloud hybrid cloud environments § Accessing data over WAN too slow Solution Benefits § Same performance as local Presto Presto § Same end-user experience § Copying data to compute cloud time Presto Presto consuming and complex Alluxio Alluxio Alluxio Alluxio § Using another storage system like S3 means expensive application changes § Using S3 via HDFS connector leads Same instance to extremely low performance / container § 100% of I/O is offloaded
Challenges running Big Data on Object Stores & Alluxio Solution Dramatically speed-up big data Transition to Object store on object stores on premise § Object stores performance for big Presto data workloads can be very poor Presto Presto Presto Solution Benefits § No native support for popular Alluxio Alluxio § Same performance as HDFS Alluxio Alluxio frameworks § Uses HDFS APIs § Same end-user experience Same container § Expensive metadata operations / machine reduce performance even more § No support for hybrid environments directly § Storage at fraction of the or or cost of HDFS
Use Cases Alluxio Enables Accelerate big data frameworks Burst big data workloads in Dramatically speed-up big data on the public cloud hybrid cloud environments on object stores on premise Presto Hive Presto Spark Spark Hive Presto Spark Hive Presto Spark Hive Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Same instance Same container Same instance / container / machine / container or or
Advanced Use Cases Spark Hive Presto Spark Presto Alluxio Alluxio Standalone Any public / private cloud Same data center / region Any Cloud / Multi Cloud or or Enable big data on object stores Orchestrate data frameworks on across single or multiple clouds the public cloud
Alluxio – Key innovations Data Locality Data Accessibility Data Elasticity with Intelligent for popular APIs & with a unified Multi-tiering API translation namespace Abstract data silos & storage Accelerate big data Run Spark, Hive, Presto, ML systems to independently scale workloads with transparent workloads on your data data on-demand with compute tiered local data located anywhere
Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Read & Write Buffering Transparent to App RAM SSD HDD Hot Warm Cold Policies for pinning, promotion/demotion, TTL
Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface POSIX Interface REST API Java File API HDFS Interface S3 Interface HDFS Driver S3 Driver Swift Driver NFS Driver
Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting with Transparent Naming
Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally HDFS #1 SUPPORTS IT OPS FRIENDLY • HDFS • Storage mounted into Alluxio Object Store • NFS by central IT • OpenStack • Security in Alluxio mirrors NFS • Ceph source data • Amazon S3 • Authentication through HDFS #2 • Azure LDAP/AD • Google Cloud • Wireline encryption
Alluxio Reference Architecture … WAN Alluxio Alluxio Worker Client RAM / SSD / HDD Under Store 1 Application … Alluxio Alluxio Worker Client Application RAM / SSD / HDD Under Store 2 Alluxio Zookeeper / Master RAFT Standby Master
Policy Driven under File System Migration hdfs://host:port/directory/ Sales Reports
Research Directions Machine-learning based Data Orchestration Policies Scalable and High-performance File System Metadata service Optimization for in-memory data partition / format Cross-layer optimization for distributed compute and storage systems
JD.com | Performance Use Case in DC $70B e-commerce retailer PRESTO SPARK PRESTO SPARK ALLUXIO Separate Compute Separate Compute 3000 Node HDFS 3000 Node HDFS Datacenter Datacenter Project: Alluxio solution: Offload HDFS with separate clusters Alluxio offloads the network I/O as • • of Presto and Spark well as the compute Problem: Result: HDFS cluster is compute and Teams can run additional workloads • • network bound without taxing the existing HDFS Performance is inconsistent cluster •
DBS Bank | Performance & Hybrid Largest bank in Southeast Asia AI & Analytics Analytics Frameworks ALLUXIO Analytics ALLUXIO AWS Frameworks Object Store Object Store HDFS Datacenter Datacenter Datacenter Initial Project: Alluxio solution: Digital Bank Initiative 1. Alluxio provides intelligent caching • Solve scaling challenges by separating layer for object storage • compute and using object storage 2. Burst workloads to hybrid cloud Problem: Result: Coupled systems were not flexible to Enables data on-demand, Alluxio now • • scale considered mature layer in stack
Walmart | Performance Use Case in Cloud PRESTO PRESTO ALLUXIO OBJECT STORE OBJECT STORE Public Cloud Public Cloud Project: Alluxio solution: Utilize Presto for interactive queries • Alluxio provides intelligent distributed • on cloud object store compute caching layer for object storage Result: Problem: High performance queries • Low performance of queries too slow • Consistent performance • to be usable Interactive query performance for • Inconsistent performance of queries • analysts
Recommend
More recommend