Apache Hadoop Ingestion & Dispersal Framework Danny Chen - PowerPoint PPT Presentation

Strata NY 2018 September 12, 2018 Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team

Agenda ● Mission ● Overview ● Need for Hadoop ingestion & dispersal framework ● Deep Dive ○ High Level Architecture ○ Abstractions and Building Blocks ● Configuration & Monitoring of Jobs ● Completeness & Data Deletion ● Learnings

Uber Apache Hadoop Platform Team Mission Build products to support reliable, scalable, easy-to-use, compliant, and efficient data transfer (both ingestion & dispersal) as well as data storage leveraging the Hadoop ecosystem.

Overview Any Source to Any Sink ● ● Ease of onboarding ● Business impact & importance of data & data store location Suite of Hadoop ecosystem tools ●

Introducing

Open Sourced in September 2018 https://github.com/uber/marmaray Blog Post: https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

Marmaray (Ingestion): Why? ● Raw data needed in Hadoop data lake ● Ingested raw data -> Derived Datasets ● Reliable and correct schematized data Maintenance of multiple data pipelines ●

Marmaray (Dispersal): Why? Derived datasets in Hive ● Need arose to serve live ● traffic Duplicate and ad hoc ● dispersal pipelines Future dispersal needs ●

Marmaray: Main Features ● Release to production end of 2017 ● Automated schema management ● Integration w/ monitoring & alerting systems ● Fully integrated with workflow orchestration tool ● Extensible architecture ● Open sourced

Marmary: Uber Eats Use Case

Hadoop Data Ecosystem at Uber

Hadoop Data Ecosystem at Uber Analytical Processing Marmaray Marmaray Ingestion Hadoop Dispersal Data Lake Schemaless

High-Level Architecture & Technical Deep Dive

High-Level Architecture Datafeed Config Store Schema Service Error Tables Metadata Manager (Checkpoint store) Work Unit Calculator Output Input Sink Source Converter1 Converter 2 Storage Storage Connector Connector System System Chain of converters M3 Monitoring & Alerting System

High-Level Architecture Datafeed Config Store Schema Service Error Tables Metadata Manager (Checkpoint store) Work Unit Calculator Output Input Sink Source Converter1 Converter 2 Storage Storage Connector Connector System System Chain of converters M3 Monitoring System

Schema Service Get Schema by Name & version Get Schema Binary Data Service Reader Reader / Schema Decoder Service Generic Record Get Schema Generic Data Service Writer Writer / Encoder Binary Data

High-Level Architecture Topic Config Store Schema Service Error Tables Metadata Manager (Checkpoint store) Work Unit Calculator Output Input Sink Source Converter1 Converter 2 Storage Storage Connector Connector System System Chain of converters M3 Monitoring System

Metadata Manager Metadata Manager init() Called on Job start Set (key, value) called 0 or more Different Job In-Memory times Persistent DAG Copy Components Storage Get(key) -> value (ex.HDFS) called 0 or more times persist() Called after Job finish

Fork Operator - Why is it needed? Avoid reprocessing input ● Schema records Conforming records Input Records Avoid re-reading input ● records (or in Spark, re-executing input Error transformations) Records

Fork Operator & Fork Function r1, S/F Success Filter Schema Conforming function records r2, S/F Fork Tagged Input Records Function Records Failure Filter function Error rx, S/F Records Persisted using Spark’s disk/ memory persistence level

Easy to Add Support for new Source & Sink Hive Kafka Data lake with GenericRecord S3 New Source Cassandra

Support for Writing into Multiple Systems Hive Table 1 Data lake with GenericRecord Kafka Hive Table 2

JobDag & JobDagActions Job Dag Actions Report metrics for monitoring JobDAG Register table in Hive

Need for running multiple JobDags together Frequency of data arrival ● ● Number of messages ● Avg record size & complexity of schema Spark job has Driver + executors (1 or more) ● Not efficient model to handle spikes ● ● Too many topics to ingest. 2000+

JobManager Ingesting kafka-topic 1 (JobDAG 1) Job ● Single Spark job for running Mgr ingestion for 300+ topics ● Executes multiple JobDAGs 1 Spark ● Manages execution ordering for Ingesting kafka-topic N (JobDAG N) Job multiple JobDAGs ● Manages shared Spark context Waiting Q for JobDAGs ● Enables job and tier-level locking

Completeness Latest Latest Bucket Bucket 10 min buckets 10 min buckets Source Sink (Kafka) (Hive)

Completeness contd.. Why not run queries on source and sink dataset periodically? ● ○ Possible for very small datasets Won’t work for billions of records; very expensive!! ○ ● Bucketizing records ○ How about creating time based buckets say for every 2min or 10min. ○ Count records at source and sink during every runs ■ Does it give 100% guarantee?? No but w.h.p. it is close to it.

Completeness - High level approach Marmaray Input Output Input Success Records Record Record (OR) (IR) (ISR) Hoodie Src Sink Kafka Converter Converter (Hive) Input Output Error Error Error Table Record Record (IER) (OER) IR IER OER OR

Hadoop old way of storing kafka data 2014 Stitched parquet files 01 2015 (~4GB) (~400 files per 01 partition) 02 02 Kafka topic1 2018 Non-stitched parquet 06 08 files (~40MB) (~20-40K files per partition) Latest Date Partition

Data Deletion (Kafka) Old architecture is designed to be append/read only ● ● No indexes ○ Need to scan entire partition to find out if record is present or not Only way to update is to rewrite entire partition ● ○ Re-writing entire partition for GDPR requires all data to be cleaned up once user requests deletion ● This is a big architectural change and many companies are struggling to ● solve this

Marmaray + HUDI (hoodie) to rescue

f1_ts1.parquet f2_ts1.parquet Hoodie Data layout 01 f3_ts1.parquet f4_ts1.parquet 01 02 02 f5_ts2.parquet f6_ts2.parquet 2014 f7_ts2.parquet 2015 f1_ts3.parquet f8_ts3.parquet 06 08 Kafka Topic Updates 2018 ts1.commit .hoodie ts2.commit ts3.commit Hoodie metadata

Configuration common: hadoop: fs.defaultFS: "hdfs://namenode/" hoodie: table_name: "mydb.table1" base_path: "/path/to/my.db/table1" metrics_prefix: "marmaray" enable_metrics: true parallelism: 64 kafka: conn: bootstrap.servers: "kafkanode1:9092,kafkanode2:9092" fetch.wait.max.ms: 1000 socket.receive.buffer.bytes: 5242880 fetch.message.max.bytes: 20971520 auto.commit.enable: false fetch.min.bytes: 5242880 source: topic_name: "topic1" max_messages: 1024 read_parallelism: 64 error_table: enabled: true dest_path: "/path/to/my.db/table1/.error" date_partitioned: true

Monitoring & Alerting

Learnings - Spark - Off heap memory usage of spark and YARN killing our containers - External shuffle server overloading - Parquet - Better record compression with column alignments - Kafka - Be gentle while reading from kafka brokers - Cassandra - Cassandra SSTable streaming (no throttling) , no monitoring - No backfill for dispersal

External Acknowledgments https://gobblin.readthedocs.io/en/latest/

Other Relevant Talks Your 5 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Ube r - Wed 11:20am Hudi: Unifying storage and serving for batch and near-real-time analytics - Wed 5:25 pm

We are hiring! Positions available: Seattle, Palo Alto & San Francisco email : hadoop-platform-jobs@uber.com

Useful links ● https://github.com/uber/marmaray ● https://eng.uber.com/marmaray-hadoop-ingestion-open-sour ce/ ● https://github.com/uber/hudi ● https://eng.uber.com/michelangelo/ ● https://eng.uber.com/m3/

Q & A?

Follow our Facebook page: www.facebook.com/uberopensource

Apache Hadoop Ingestion & Dispersal Framework Danny Chen - PowerPoint PPT Presentation

Strata NY 2018 September 12, 2018 Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Agenda Mission

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Streamline Hadoop DevOps with Apache Ambari Alejandro Fernandez Speaker

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Miniature Damped Accelerometer Series offers Wide Range of Applications developed &

The Complexity of Achieving Stability in Sirte, Libya COL Nate Prussian Advisor: Dr. Natalia

Potency & Stability Testing for ATMP SME Workshop EMA Marcel Hoefnagel & Charlotte De

Stable Homes Stable Schools Cumulative Program Report - Quarter 1 2020 Report to POGO Committee

EPSCoR Project Research Components and Budget Overview University of Alaska Task 1: Development of

Turing degrees of orders on torsion-free abelian groups Reed Solomon joint with Asher Kach and

On Sets of Commuting and Anticommuting Paulis Rahul Sarkar 1,2 Ewout van den Berg 2 1 Institute for

APPLICATION EXAMPLES By: Yasmine A. El-Ashi Outline Peter Sylow Example 1 Example 2

Apache Hadoop Ingestion & Dispersal Framework Danny Chen - PowerPoint PPT Presentation

Strata NY 2018 September 12, 2018 Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Agenda Mission

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Streamline Hadoop DevOps with Apache Ambari Alejandro Fernandez Speaker

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Miniature Damped Accelerometer Series offers Wide Range of Applications developed &amp;

The Complexity of Achieving Stability in Sirte, Libya COL Nate Prussian Advisor: Dr. Natalia

Potency &amp; Stability Testing for ATMP SME Workshop EMA Marcel Hoefnagel &amp; Charlotte De

Stable Homes Stable Schools Cumulative Program Report - Quarter 1 2020 Report to POGO Committee

EPSCoR Project Research Components and Budget Overview University of Alaska Task 1: Development of

Turing degrees of orders on torsion-free abelian groups Reed Solomon joint with Asher Kach and

On Sets of Commuting and Anticommuting Paulis Rahul Sarkar 1,2 Ewout van den Berg 2 1 Institute for

APPLICATION EXAMPLES By: Yasmine A. El-Ashi Outline Peter Sylow Example 1 Example 2

Miniature Damped Accelerometer Series offers Wide Range of Applications developed &

Potency & Stability Testing for ATMP SME Workshop EMA Marcel Hoefnagel & Charlotte De