Building Data Orchestration for Big Data Analytics in the Cloud Bin - PowerPoint PPT Presentation

Building Data Orchestration for Big Data Analytics in the Cloud Bin Fan | Founding Engineer | Alluxio binfan@alluxio.com 07/17/2019

About Me @binfan binfan@alluxio.com @apc999 Founding Engineer & Open Source Maintainer | Alluxio

The Alluxio Story Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) 2013 Li. Open Source project established & company to commercialize Alluxio founded 2015 Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2018 2019

Incredible Open Source Momentum with growing community 1000+ contributors & Apache 2.0 Licensed growing Hundreds of thousands 4000+ Git Stars of downloads Join the conversation on Slack alluxio.io/slack

Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE COMPUTE STORAGE STORAGE

Data stack journey and innovation paths Support more frameworks Co-located Disaggregated Support Presto, Spark Co-located Disaggregated across DCs without compute & HDFS compute & HDFS app changes on the same cluster on the same cluster HDFS for Hybrid Cloud Hive MR / Hive Burst HDFS data in HDFS the cloud, HDFS public or private Transition to Object store ▪ ▪ Typically compute-bound Compute & I/O can be clusters over 100% capacity scaled independently but Enable & accelerate ▪ Compute & I/O need to be I/O still needed on HDFS big data on scaled together even when which is expensive not needed object stores

Independent scaling of compute & storage POSIX Interface Java File API HDFS Interface S3 Interface REST API Data Orchestration for the Cloud HDFS Driver Swift Driver S3 Driver NFS Driver

APIs to Interact with data in Alluxio Application have great flexibility to read / write data with many options Spark > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) Presto CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') POSIX $ cat /mnt/alluxio/myInput Java FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

Use Case: Distributed Caching for Cloud Storage Accelerate analytical frameworks Compute caching for S3 / GCS on the public cloud ▪ S3 performance is variable and consistent query SLAs are hard to achieve Spark Spark Spark Spark ▪ S3 metadata operations are expensive Alluxio Alluxio making workloads run longer Alluxio Alluxio ▪ S3 egress costs add up making the Same instance / container solution expensive ▪ S3 is eventually consistent making it hard to predict query results or

Use Case: Data Federation with Hybrid Cloud Burst big data workloads in HDFS for Hybrid Cloud hybrid cloud environments ▪ Accessing data over WAN too slow Solution Benefits ▪ Same performance as local Presto Presto ▪ Same end-user experience ▪ Copying data to compute cloud time Presto Presto consuming and complex Alluxio Alluxio Alluxio Alluxio ▪ Using another storage system like S3 means expensive application changes ▪ Using S3 via HDFS connector leads Same instance / to extremely low performance container ▪ 100% of I/O is offloaded

Abstract & orchestrate data across data silos COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS ANY TENSOR DATA HIVE SPARK SPARK FLOW PRESTO APP DATA DATA DATA DATA DATA DATA ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION S3 HDFS NFS HDFS DATA IN DISPARATE STORAGE SYSTEMS

Alluxio – Key Innovations Data Locality Data Accessibility Data Elasticity with Intelligent for popular APIs & with a unified Multi-tiering API translation namespace Abstract data silos & storage Accelerate big data Run Spark, Hive, Presto, ML systems to independently scale workloads with transparent workloads on your data data on-demand with compute tiered local data located anywhere

Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Read & Write Buffering Transparent to App RAM SSD HDD Hot Warm Cold Policies for pinning, promotion/demotion, TTL

Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface POSIX Interface REST API Java File API HDFS Interface S3 Interface HDFS Driver S3 Driver Swift Driver NFS Driver

Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting with Transparent Naming

Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally HDFS #1 SUPPORTS IT OPS FRIENDLY HDFS Storage mounted into Alluxio • • Object Store NFS by central IT • OpenStack Security in Alluxio mirrors • • NFS Ceph source data • Amazon S3 Authentication through • • HDFS #2 Azure LDAP/AD • Google Cloud Wireline encryption • •

Companies Using Alluxio

Bazaarvoice Leading Digital marketing Company in Austin Use Case | Compute Caching for Cloud Hive Hive Alluxio AWS S3 AWS S3 ▪ Cache hot data in Alluxio, keep all data in S3 ▪ Faster time to insights with seamless data orchestration ▪ Accelerated workloads with memory-first data approach by 10x https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs- on-aws-s3-by-10x-with-alluxio-tiered-storage/

China Unicom Leading Chinese Telco serving 320 million subscribers Use case | Data orchestration for agility SPARK Kubernetes SPARK DATA ORCHESTRATION SPARK ETL HDFS OBJECT HBASE HDFS OBJECT HBASE ▪ Single namespace to access & address all data ▪ Data local to compute accelerates workloads

Architecture & Data Flow

Alluxio Reference Architecture … WA N Alluxio Alluxio Worker Client RAM / SSD / HDD Applicatio Under Store 1 … n Alluxio Alluxio Worker Client Applicatio RAM / SSD / HDD Under Store 2 n Alluxio Zookeeper Master / RAFT Standby Master

Alluxio Files and Blocks - Files are immutable once completed Flexible Block Sizes - Blocks are stored on Alluxio Workers Default block size is (512 MB) • Blocks of a file can be on different workers If understore block size is greater: The file will • only take up as much space as needed If understore block size is smaller: File will be • split up among multiple blocks Last block of a file is not required to be a full • block size Alluxio File Block 1 Block 2 Block 3 Block 4 Alluxio Alluxio Worker1 Worker2

Alluxio Master – Metadata Service ▪ Master responsible for managing metadata File System ▪ File system namespace (inode tree) Metadata ▪ Block / worker info ▪ Standby masters used for checkpointing and RPC Block fault tolerance mode Service Metadata ▪ Zookeeper / RAFT used for leader election ▪ Master writes journal for durable operations Worker ▪ Standby masters replay changes from the journal Metadata ▪ Performs Under Store metadata operations Under Store 23

Efficient Metadata Operations: Alluxio on S3 ▪ Efficient bucket listing: ▪ Key operations for SparkSQL/Presto query planning ▪ Object metadata will be cached in Alluxio after 1 st read ▪ Efficient file rename ▪ Slow operations on S3 as a copy followed by delete ▪ Alluxio implements “persist after rename” ▪ Enables Speculative execution ▪ Batching UFS operations to S3

Alluxio Workers – Data Service ▪ Workers responsible for storing and serving block data Block Metadata RPC ▪ Each worker manages the metadata for the Service block data it stores ▪ Workers store block data on various local storage mediums Data Transfer ▪ Memory Servic e ▪ SSD ▪ HDD ▪ Performs Under Store data operations RAM / SSD / HDD Under Store Data is outside of worker JVM 25

Key Innovations & Optimization in Data Service ▪ Avoid JVM GC: ▪ Storing blocks off-heap (e.g., RAMDISK) ▪ Data Capacity: ▪ Tiered Storage Management using HDD, SSD, MEM ▪ Data Throughput: ▪ Fine grained block locking for high concurrency ▪ gRPC based streaming-RPC service stub ▪ Async Data Archival to S3 ▪ Apps write to Alluxio (at Alluxio speed), then Alluxio persist data to S3 async (at S3 speed)

Interacting with data in Alluxio – flexible app patterns Application have great flexibility to read / write data with many options Writing Data Reading Data Write only to Alluxio From under store • • Write only to Under Store • From a co-located Alluxio • Write synchronously to Alluxio and node • Under Store From a different Alluxio • Write to Alluxio and • node asynchronously write to Under Store Write to Alluxio and replicate to N • other workers Write to Alluxio and async write to • multiple Under stores

Read data in Alluxio, on same node as client Memory Speed Read of Data Application Alluxio Alluxio Worker Master Alluxio Client RAM / SSD / HDD 28

Read data not in Alluxio + Caching Network / Disk Speed Read of Data Application Alluxio Alluxio Worker Master Under Store Alluxio Client RAM / SSD / HDD 29

Write data only to Alluxio on same node as client Memory Speed Write of Data Application Alluxio Alluxio Worker Master Alluxio Client RAM / SSD / HDD 30

Building Data Orchestration for Big Data Analytics in the Cloud Bin - PowerPoint PPT Presentation

Building Data Orchestration for Big Data Analytics in the Cloud Bin Fan | Founding Engineer | Alluxio binfan@alluxio.com 07/17/2019 About Me @binfan binfan@alluxio.com @apc999 Founding Engineer & Open Source Maintainer | Alluxio The

ERICA A cloud orchestration meta-framework for secure health data analytics Tim Churches SW

Data Orchestration with Apache Airflow Data driven empower the organization to seek more

Unicorn: Unified Resource Orchestration for Multi- Domain, Geo-Distributed Data Analytics Qiao

Alluxio: Open Source Data Orchestration for Analytics and AI in the Cloud Haoyuan (H.Y.) Li |

Building a culture of data-informed decision making: lessons in one year of data analytics at

Analytics Software for Energy Management and Building Systems Optimization and Equipment Fault

Big Data Analytics Building Blocks. Simple Data Storage (SQLite) Duen Horng (Polo) Chau

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Building a Case for Sustainability Using Medicaid Data Judy Temple, Data Analytics Medicaid CHIP

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Building Multi-Model Big Data Platform for Real Estate Analytics Karthik Karuppaiya ApacheCon Big

Maximizing the Value of Data Analytics for Operational Risk Intelligence Don't just do data

Predictive Simulation & Big Data Analytics ISD Analytics Predict a better future

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Building Energy Data Analytics: Current Status and Future Directions Brock Glasgo, Postdoctoral

Richey Mays Data Analytics TYLER HOUSE TYLER@RICHEYMAY.COM Data Analytics Dashboards HMDA

Building a Distributed Data Access Layer for Analytics on Any Cloud Bin Fan | Founding Engineer

Track Description Level Session Link ABD Analytics & Big Data 201 Big Data Architectural

Data Analytics in Healthcare Health Data Analytics Conference October 2017 Dr Richard Ashby

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

$1.6T New How? analytics speed data dividend available More to businesses that embrace data

Scaling Data Analytics Jan Vitek Challenges How do we program big data? What are the

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

Building Data Orchestration for Big Data Analytics in the Cloud Bin - PowerPoint PPT Presentation

Building Data Orchestration for Big Data Analytics in the Cloud Bin Fan | Founding Engineer | Alluxio binfan@alluxio.com 07/17/2019 About Me @binfan binfan@alluxio.com @apc999 Founding Engineer & Open Source Maintainer | Alluxio The

ERICA A cloud orchestration meta-framework for secure health data analytics Tim Churches SW

Data Orchestration with Apache Airflow Data driven empower the organization to seek more

Unicorn: Unified Resource Orchestration for Multi- Domain, Geo-Distributed Data Analytics Qiao

Alluxio: Open Source Data Orchestration for Analytics and AI in the Cloud Haoyuan (H.Y.) Li |

Building a culture of data-informed decision making: lessons in one year of data analytics at

Analytics Software for Energy Management and Building Systems Optimization and Equipment Fault

Big Data Analytics Building Blocks. Simple Data Storage (SQLite) Duen Horng (Polo) Chau

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

Building a Case for Sustainability Using Medicaid Data Judy Temple, Data Analytics Medicaid CHIP

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Building Multi-Model Big Data Platform for Real Estate Analytics Karthik Karuppaiya ApacheCon Big

Maximizing the Value of Data Analytics for Operational Risk Intelligence Don't just do data

Predictive Simulation &amp; Big Data Analytics ISD Analytics Predict a better future

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Building Energy Data Analytics: Current Status and Future Directions Brock Glasgo, Postdoctoral

Richey Mays Data Analytics TYLER HOUSE TYLER@RICHEYMAY.COM Data Analytics Dashboards HMDA

Building a Distributed Data Access Layer for Analytics on Any Cloud Bin Fan | Founding Engineer

Track Description Level Session Link ABD Analytics &amp; Big Data 201 Big Data Architectural

Data Analytics in Healthcare Health Data Analytics Conference October 2017 Dr Richard Ashby

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Data

$1.6T New How? analytics speed data dividend available More to businesses that embrace data

Scaling Data Analytics Jan Vitek Challenges How do we program big data? What are the

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Predictive Simulation & Big Data Analytics ISD Analytics Predict a better future

Track Description Level Session Link ABD Analytics & Big Data 201 Big Data Architectural

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data