Enabling Ultra-fast Presto in the Cloud with Alluxio Haoyuan (H.Y.) Li | Founder & CTO | Alluxio | haoyuan@alluxio.com | alluxio.io/slack 2019-12-11 @ Presto Summit NYC ALLUXIO 2019
Outline • Alluxio Overview: History and its Open Source Community • Presto Alluxio Stack (PAS) Today: Architecture, Benefit, Production Use Cases • Alluxio Structured Data Service: Deeper Integration with SQL Engines like Presto ALLUXIO 2019
Alluxio Overview History and Open Source Community ALLUXIO 2019
The Alluxio Story Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2013 Open Source project established & company to commercialize Alluxio founded 2015 Goal: Orchestrate Data for Analytics & ML in the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2018 2019
Open Source Started From UC Berkeley AMPLab 1000+ contributors & Apache 2.0 Licensed growing GitHub’s Top 100 Most Valuable Repositories 4000+ Git Stars Join the Out of 96 Million conversation on Slack slackin.alluxio.io
Companies Running Alluxio (Learn More) Financial Services Retail & Entertainment Data & Analytics Services Technology Consumer Telco & Media Travel & Transportation
Four trends driving the need for a new architecture Separation of Hybrid – Multi Self-service Rise Compute & cloud data across of the object Storage environments the enterprise store
Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE COMPUTE STORAGE STORAGE
Data Ecosystem 1.0 – The Challenges COMPUTE Complex Low performance Expensive STORAGE
Data silos cross data centers, regions, clouds COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS PRESTO Spark PRESTO AZURE S3 WAN WAN TENSOR HIVE Presto FLOW OBJECT HDFS HDFS STORE NFS HDFS DATA IN DISPARATE STORAGE SYSTEMS
Alluxio: an Open Source Data Orchestration System
Data Platform using a Data Orch chestration Approach ch COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS ANY TENSOR DATA HIVE Presto FLOW SPARK PRESTO APP DATA DATA DATA DATA DATA DATA ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION S3 NFS HDFS DATA IN DISPARATE STORAGE SYSTEMS
Presto Alluxio Stack (PAS) Today Architecture, Benefit, Production Use Cases ALLUXIO 2019
Why Presto on Alluxio § Distributed Data Orchestration (including caching) on Demand • Faster: Lower query latency • SLA: More consistent performance • Efficiency: More concurrency and Less data transfer § Deeper Presto Alluxio Integration • New Alluxio catalog service Now available as Developer Preview in v2.1 • New Alluxio transformation service 15
How Presto Works with Alluxio Presto Presto Read/Write Read/Write Metadata Metadata Read/Write Data Hive Read/Write Hive Alluxio Metastore Data Metastore location=alluxio:///table location=s3://bucket/table Mounted to Alluxio 16
How to Use Alluxio in Presto CLI Create A Table on Alluxio > CREATE TABLE alluxio_table (id varchar) WITH (external_location = 'alluxio:///table'); Read A Table from Alluxio > SELECT * FROM alluxio_table 17
Challenges with running workloads on cloud storage Compute caching for S3 / Accelerate analytical GCS frameworks on the public cloud ▪ S3 performance is variable and consistent query SLAs are hard to achieve Spark Spark Spark Presto ▪ S3 metadata operations are expensive Alluxio Alluxio making workloads run longer Alluxio Alluxio ▪ S3 egress costs add up making the Same instance / container solution expensive ▪ S3 is eventually consistent making it hard to predict query results or
Challenges with Hybrid Cloud HDFS for Hybrid Burst big data workloads in Cloud hybrid cloud environments ▪ Accessing data over WAN too slow Solution Benefits ▪ Same performance as local Presto Presto ▪ Same end-user experience ▪ Copying data to compute cloud time Presto Presto consuming and complex Alluxio Alluxio Alluxio Alluxio ▪ Using another storage system like S3 means expensive application changes ▪ Using S3 via HDFS connector leads Same instance to extremely low performance / container ▪ 100% of I/O is offloaded
Challenges running Big Data on Object Stores & Alluxio Solution Transition to Object Dramatically speed-up big data on object stores on premise store ▪ Object stores performance for big Presto data workloads can be very poor Presto Presto Presto Solution Benefits ▪ No native support for popular Alluxio Alluxio ▪ Same performance as HDFS Alluxio Alluxio frameworks ▪ Uses HDFS APIs ▪ Same end-user experience Same container ▪ Expensive metadata operations / machine reduce performance even more ▪ No support for hybrid environments directly ▪ Storage at fraction of the or or cost of HDFS
Robolox Use Case | Compute Caching for Cloud Presto Presto Alluxio AWS S3 AWS S3 ▪ Cache hot data in Alluxio, leaving all data in S3 ▪ Reduce Presto queries from 10 sec to sub second ▪ Faster time to provide data scientists insights
NetEase Games Leading Online Game Company in China Use Case | On-premise Caching for Presto Presto Presto Alluxio HDFS HDFS ▪ Large query variance during peak hours before ▪ Alluxio brings data local to Presto to reduce the latency during peak hours https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/
Architecture: Colocate Alluxio with Presto • Black/Red line – Large Query variance without Alluxio • Green line - Stable query time with Alluxio
JD.com Leading Online Retailer in China Use Case | On-premise Satellite Cluster for Presto SPARK Presto SPARK Presto Alluxio HDFS HDFS ▪ Presto workers may read remotely from HDFS datanodes -> large query variance ▪ Data local to Presto accelerates workloads https://www.slideshare.net/Alluxio/alluxio-in-jd
Architecture: Colocate Alluxio with Presto 25
Pe Performance Evaluation • Yellow line - Stable query time with Alluxio < 1sec after first query (cold read) • • Green line – JD Presto without Alluxio : > 10sec
Mor More E Examp mples De Details ails: www.alluxio.io/power www ered ed-by by-allu alluxio io/ www www.alluxio.io/data-or orchestration on-su summit-2019/ 2019/ 27
Common Use Cases Zero-copy burst workloads in On-premise satellite Accelerate query performance hybrid cloud environments compute clusters across data centers as cloud storage caching Satellite Presto Cluster Spark Spark Spark Presto Hive Presto Hive Hive Presto Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Main Hadoop Cluster Hive Spark 28
Advanced Use Cases Spark Hive Presto Spark Presto Alluxio Alluxio Standalone Any public / private cloud Same data center / region Any Cloud / Multi Cloud or or Enable big data on object stores Orchestrate data frameworks across single or multiple clouds on the public cloud
Now available as Developer Preview in v2.1 Alluxio Structured Data Service Deeper Integration with SQL Engines like Presto ALLUXIO 2019
Storage Systems SQL Frameworks Files/Objects Tables Directories Schemas Impedance Mismatch Raw Bytes Rows/Columns Cost-efficiency Compute-optimized Durability Further Expand Benefits! Computation 31
Benefits of Alluxio Data Orchestration Caching Unified Interface/Namespace Storage SQL Schema-Aware Optimizations Systems Frameworks Compute-Optimized Formats Physical Data Independence 32
Alluxio Structured Data Service (from v2.1) Presto Alluxio Hive Connector Connector Alluxio Catalog Alluxio Caching Alluxio Transformation Service Service Service Hive Metastore Storage 33
Alluxio Structured Data Service Summary • Significantly speed up queries! • Detailed presentation: www.alluxio.io/resources/videos/alluxio- innovations-for-structured-data/ • Try it out! 34
Next Step § Check out more tutorials https://www.alluxio.io/presto/ § More Video & Slides: https://www.alluxio.io/data-orchestration- summit-2019/ § Additional Reads: • Starburst Presto + Alluxio = better together https://www.starburstdata.com/technical-blog/starburst-presto-alluxio-better-together/ • Top 5 performance tips running Presto with Alluxio https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1 • Presto + Alluxio + Hive Metastore on your Laptop in 10 min https://www.alluxio.io/blog/tutorial-presto-alluxio-hive-metastore-on-your-laptop-in-10-min/ • Alluxio Structure Data Service: https://www.alluxio.io/resources/videos/alluxio-innovations-for- structured-data/ 35
Thank you! Questions? www.alluxio.io | slackin.alluxio.io | @alluxio | haoyuan@alluxio.com
Recommend
More recommend