Building a Distributed Data Access Layer for Analytics on Any Cloud - PowerPoint PPT Presentation

Building a Distributed Data Access Layer for Analytics on Any Cloud Bin Fan | Founding Engineer & VP Open Source | Alluxio binfan@alluxio.com

About Me @binfan binfan@alluxio.com

The journey to a fragmented data world More data More people & teams need New storage technologies access to this data generated every day created every 3-8 years

4 big trends driving the need for a new architecture Separation of Hybrid – Multi Rise Self-service Compute & cloud of the object data across the Storage environments store enterprise

Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE COMPUTE STORAGE STORAGE

Big data journey and innovation options for enterprises HDFS for Hybrid Cloud Burst HDFS data in the cloud, Co-located Disaggregated public or private Support more frameworks Co-located Disaggregated Support Presto, Spark compute & HDFS compute & HDFS and other computes on the same cluster on the same cluster without app changes Transition to Object store Hive MR / Hive Enable & accelerate HDFS HDFS big data on object stores

Challenges with the transition Support more frameworks Transition to Object store HDFS for Hybrid Cloud ▪ Accessing data over WAN too ▪ Copying data to multiple ▪ Object stores performance for slow compute clouds time consuming big data workloads can be very and error prone poor ▪ Copying data to compute cloud time consuming and complex ▪ Migrating applications for new ▪ No native support for popular storage systems is complex & frameworks ▪ Using another storage system like time consuming S3 means expensive application ▪ Expensive metadata operations changes ▪ Storing and managing multiple reduce performance even more copies of the data becomes ▪ Using S3 via HDFS connector expensive ▪ No support for hybrid leads to extremely low environments directly performance

Independent scaling of compute & data POSIX Interface REST API Java File API HDFS Interface S3 Interface Data Orchestration for the Cloud HDFS Driver Swift Driver S3 Driver NFS Driver

Use Cases Data Orchestration Enables Accelerate big data frameworks Burst big data workloads in Dramatically speed-up big data hybrid cloud environments on the public cloud on object stores on premise On-premise Hive Presto Spark Alluxio Alluxio Alluxio Same container / machine Same instance / Same instance / container container On premise or or

Advanced Use Cases Spark Hive Spark Presto Presto Alluxio Alluxio Standalone Any public / Same data private cloud center / region Any Cloud / Multi Cloud or or Enable big data on object stores Orchestrate data frameworks on across single or multiple clouds the public cloud

Alluxio – Key innovations Data Locality Data Accessibility Data Elasticity with Intelligent for popular APIs & with a unified Multi-tiering API translation namespace Abstract data silos & storage Accelerate big data Run Spark, Hive, Presto, ML systems to independently scale workloads with transparent workloads on your data data on-demand with compute tiered local data located anywhere

Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Read & Write Buffering Transparent to App RAM SSD HDD Hot Warm Cold Policies for pinning, promotion/demotion, TTL

Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface FUSE Interface REST API Java File API HDFS Interface S3 Interface HDFS Driver S3 Driver Swift Driver NFS Driver

Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting with Transparent Naming

Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally HDFS #1 SUPPORTS IT OPS FRIENDLY HDFS Storage mounted into Alluxio • • Object Store NFS by central IT • OpenStack Security in Alluxio mirrors • • NFS Ceph source data • Amazon S3 Authentication through • • HDFS #2 Azure LDAP/AD • Google Cloud Wireline encryption • •

Abstract & orchestrate data across data silos COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS ANY TENSOR DATA HIVE SPARK SPARK FLOW PRESTO APP DATA DATA DATA DATA DATA DATA ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION S3 HDFS NFS HDFS DATA IN DISPARATE STORAGE SYSTEMS

Demos in Office Hour: ● Spark + Alluxio + S3 & Azure ● TPC-DS on Spark+S3 vs Spark+Alluxio+S3

Interacting with data in Alluxio – variety of APIs Application have great flexibility to read / write data with many options Spark > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) Hadoop $ hadoop fs -cat alluxio://localhost:19998/myInput POSIX $ cat /mnt/alluxio/myInput Java FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

Deployment Approaches Spark Spark Presto Alluxio Alluxio Same instance / Same data container center / region Any Cloud Any Cloud Storage Storage Co-locate Alluxio Workers with Spark for Deploy Alluxio as standalone cluster optimal I/O performance between Spark and Storage

Alluxio Reference Architecture Alluxio WA Alluxio Worker N Client Object Store RAM / SSD / HDD Applicatio n Under Store 1 Alluxio Alluxio Worker Client Applicatio RAM / SSD / HDD n Under Store Alluxio 2 Master Zookeeper / RAFT Standby Master

Interacting with data in Alluxio – flexible app patterns Application have great flexibility to read / write data with many options Writing Data Reading Data Write only to Alluxio From under store • • Write only to Under Store From a co-located Alluxio • • Write synchronously to Alluxio and node • Under Store From a different Alluxio • Write to Alluxio and • node asynchronously write to Under Store Write to Alluxio and replicate to N • other workers Write to Alluxio and async write to • multiple Under stores

Read data in Alluxio, on same node as client Memory Speed Read of Data Application Alluxio Alluxio Worker Master Alluxio Client RAM / SSD / HDD 22

Read data not in Alluxio Network / Disk Speed Read of Data Application Alluxio Alluxio Worker Master Under Store Alluxio Client RAM / SSD / HDD 23

Write data only to Alluxio on same node as client Memory Speed Write of Data Application Alluxio Alluxio Worker Master Alluxio Client RAM / SSD / HDD 24

Write data to Alluxio and Under Store synchronously Network / Disk Speed Write of Data Application Alluxio Alluxio Under Store Worker Master Alluxio Client RAM / SSD / HDD 25

Interacting with data in Alluxio – data management Application have great flexibility to read / write data with many options Data Management Pinning • Prefetch/free • Cross storage copy and move operations • TTL •

China Unicom Leading Chinese Telco serving 320 million subscribers Use case | Data orchestration for agility SPARK Kubernetes SPARK DATA ORCHESTRATION SPARK ETL HDFS OBJECT HBASE HDFS OBJECT HBASE ▪ Single namespace to access & address all data ▪ Data local to compute accelerates workloads

Two Sigma Fastest growing big hedge fund managing $46 billion for investors Use case | Cloud bursting on-premise data SPARK SPARK Public Cloud DATA ORCHESTRATION Public Cloud HDFS HDFS ▪ Compute scales elastically independent of storage ▪ Faster time to insights with seamless data orchestration ▪ Accelerated workloads with memory-first data approach

Enterprises moving towards independent compute & storage

Join the Alluxio Open Source Community www.alluxio.org/slack

Building a Distributed Data Access Layer for Analytics on Any Cloud - PowerPoint PPT Presentation

Building a Distributed Data Access Layer for Analytics on Any Cloud Bin Fan | Founding Engineer & VP Open Source | Alluxio binfan@alluxio.com About Me @binfan binfan@alluxio.com The journey to a fragmented data world More data More

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

CompSci 356: Computer Network Architectures Lecture 23: Application Layer Protocols Chapter 9.1

4 Network Layer Network Layer Network Layer Network Layer Switching Via Memory Three types of

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Chapter 5: The Data Link Layer Chapter 5 Link Layer and LANs Our goals: understand

Building Data Orchestration for Big Data Analytics in the Cloud Bin Fan | Founding Engineer |

A Common API for Transparent Hybrid Multicast Matthias Whlisch, Thomas C. Schmidt Stig Venaas

Todays Objec2ves Naming Challenges Domain Name System Oct 9, 2017 Sprenkle - CSCI325

CSE543 Computer and Network Security Module: Network Security Professor Trent Jaeger Fall 2010

3 - Namespaces Andreas Pieris and Wolfgang Fischl, Summer term 2016 Outline The Need for

Mixin Up the ML Module System Derek Dreyer and Andreas Rossberg Max Planck Institute for

Replacing iptables with eBPF in Kubernetes with Cilium Cilium, eBPF, Envoy, Istio, Hubble Michal

Monitoring Kubernetes with Prometheus Henri Dubois-Ferriere @henridf Percona Live, 2018-11-06

Sambuz

Useful Links

Newsletter

Mail Us