apache hadoop ingestion dispersal framework
play

Apache Hadoop Ingestion & Dispersal Framework Danny Chen - PowerPoint PPT Presentation

Strata NY 2018 September 12, 2018 Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Agenda Mission


  1. Strata NY 2018 September 12, 2018 Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team

  2. Agenda ● Mission ● Overview ● Need for Hadoop ingestion & dispersal framework ● Deep Dive ○ High Level Architecture ○ Abstractions and Building Blocks ● Configuration & Monitoring of Jobs ● Completeness & Data Deletion ● Learnings

  3. Uber Apache Hadoop Platform Team Mission Build products to support reliable, scalable, easy-to-use, compliant, and efficient data transfer (both ingestion & dispersal) as well as data storage leveraging the Hadoop ecosystem.

  4. Overview Any Source to Any Sink ● ● Ease of onboarding ● Business impact & importance of data & data store location Suite of Hadoop ecosystem tools ●

  5. Introducing

  6. Open Sourced in September 2018 https://github.com/uber/marmaray Blog Post: https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

  7. Marmaray (Ingestion): Why? ● Raw data needed in Hadoop data lake ● Ingested raw data -> Derived Datasets ● Reliable and correct schematized data Maintenance of multiple data pipelines ●

  8. Marmaray (Dispersal): Why? Derived datasets in Hive ● Need arose to serve live ● traffic Duplicate and ad hoc ● dispersal pipelines Future dispersal needs ●

  9. Marmaray: Main Features ● Release to production end of 2017 ● Automated schema management ● Integration w/ monitoring & alerting systems ● Fully integrated with workflow orchestration tool ● Extensible architecture ● Open sourced

  10. Marmary: Uber Eats Use Case

  11. Hadoop Data Ecosystem at Uber

  12. Hadoop Data Ecosystem at Uber Analytical Processing Marmaray Marmaray Ingestion Hadoop Dispersal Data Lake Schemaless

  13. High-Level Architecture & Technical Deep Dive

  14. High-Level Architecture Datafeed Config Store Schema Service Error Tables Metadata Manager (Checkpoint store) Work Unit Calculator Output Input Sink Source Converter1 Converter 2 Storage Storage Connector Connector System System Chain of converters M3 Monitoring & Alerting System

  15. High-Level Architecture Datafeed Config Store Schema Service Error Tables Metadata Manager (Checkpoint store) Work Unit Calculator Output Input Sink Source Converter1 Converter 2 Storage Storage Connector Connector System System Chain of converters M3 Monitoring System

  16. Schema Service Get Schema by Name & version Get Schema Binary Data Service Reader Reader / Schema Decoder Service Generic Record Get Schema Generic Data Service Writer Writer / Encoder Binary Data

  17. High-Level Architecture Topic Config Store Schema Service Error Tables Metadata Manager (Checkpoint store) Work Unit Calculator Output Input Sink Source Converter1 Converter 2 Storage Storage Connector Connector System System Chain of converters M3 Monitoring System

  18. Metadata Manager Metadata Manager init() Called on Job start Set (key, value) called 0 or more Different Job In-Memory times Persistent DAG Copy Components Storage Get(key) -> value (ex.HDFS) called 0 or more times persist() Called after Job finish

  19. Fork Operator - Why is it needed? Avoid reprocessing input ● Schema records Conforming records Input Records Avoid re-reading input ● records (or in Spark, re-executing input Error transformations) Records

  20. Fork Operator & Fork Function r1, S/F Success Filter Schema Conforming function records r2, S/F Fork Tagged Input Records Function Records Failure Filter function Error rx, S/F Records Persisted using Spark’s disk/ memory persistence level

  21. Easy to Add Support for new Source & Sink Hive Kafka Data lake with GenericRecord S3 New Source Cassandra

  22. Support for Writing into Multiple Systems Hive Table 1 Data lake with GenericRecord Kafka Hive Table 2

  23. JobDag & JobDagActions Job Dag Actions Report metrics for monitoring JobDAG Register table in Hive

  24. Need for running multiple JobDags together Frequency of data arrival ● ● Number of messages ● Avg record size & complexity of schema Spark job has Driver + executors (1 or more) ● Not efficient model to handle spikes ● ● Too many topics to ingest. 2000+

  25. JobManager Ingesting kafka-topic 1 (JobDAG 1) Job ● Single Spark job for running Mgr ingestion for 300+ topics ● Executes multiple JobDAGs 1 Spark ● Manages execution ordering for Ingesting kafka-topic N (JobDAG N) Job multiple JobDAGs ● Manages shared Spark context Waiting Q for JobDAGs ● Enables job and tier-level locking

  26. Completeness Latest Latest Bucket Bucket 10 min buckets 10 min buckets Source Sink (Kafka) (Hive)

  27. Completeness contd.. Why not run queries on source and sink dataset periodically? ● ○ Possible for very small datasets Won’t work for billions of records; very expensive!! ○ ● Bucketizing records ○ How about creating time based buckets say for every 2min or 10min. ○ Count records at source and sink during every runs ■ Does it give 100% guarantee?? No but w.h.p. it is close to it.

  28. Completeness - High level approach Marmaray Input Output Input Success Records Record Record (OR) (IR) (ISR) Hoodie Src Sink Kafka Converter Converter (Hive) Input Output Error Error Error Table Record Record (IER) (OER) IR IER OER OR

  29. Hadoop old way of storing kafka data 2014 Stitched parquet files 01 2015 (~4GB) (~400 files per 01 partition) 02 02 Kafka topic1 2018 Non-stitched parquet 06 08 files (~40MB) (~20-40K files per partition) Latest Date Partition

  30. Data Deletion (Kafka) Old architecture is designed to be append/read only ● ● No indexes ○ Need to scan entire partition to find out if record is present or not Only way to update is to rewrite entire partition ● ○ Re-writing entire partition for GDPR requires all data to be cleaned up once user requests deletion ● This is a big architectural change and many companies are struggling to ● solve this

  31. Marmaray + HUDI (hoodie) to rescue

  32. f1_ts1.parquet f2_ts1.parquet Hoodie Data layout 01 f3_ts1.parquet f4_ts1.parquet 01 02 02 f5_ts2.parquet f6_ts2.parquet 2014 f7_ts2.parquet 2015 f1_ts3.parquet f8_ts3.parquet 06 08 Kafka Topic Updates 2018 ts1.commit .hoodie ts2.commit ts3.commit Hoodie metadata

  33. Configuration common: hadoop: fs.defaultFS: "hdfs://namenode/" hoodie: table_name: "mydb.table1" base_path: "/path/to/my.db/table1" metrics_prefix: "marmaray" enable_metrics: true parallelism: 64 kafka: conn: bootstrap.servers: "kafkanode1:9092,kafkanode2:9092" fetch.wait.max.ms: 1000 socket.receive.buffer.bytes: 5242880 fetch.message.max.bytes: 20971520 auto.commit.enable: false fetch.min.bytes: 5242880 source: topic_name: "topic1" max_messages: 1024 read_parallelism: 64 error_table: enabled: true dest_path: "/path/to/my.db/table1/.error" date_partitioned: true

  34. Monitoring & Alerting

  35. Learnings - Spark - Off heap memory usage of spark and YARN killing our containers - External shuffle server overloading - Parquet - Better record compression with column alignments - Kafka - Be gentle while reading from kafka brokers - Cassandra - Cassandra SSTable streaming (no throttling) , no monitoring - No backfill for dispersal

  36. External Acknowledgments https://gobblin.readthedocs.io/en/latest/

  37. Other Relevant Talks Your 5 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Ube r - Wed 11:20am Hudi: Unifying storage and serving for batch and near-real-time analytics - Wed 5:25 pm

  38. We are hiring! Positions available: Seattle, Palo Alto & San Francisco email : hadoop-platform-jobs@uber.com

  39. Useful links ● https://github.com/uber/marmaray ● https://eng.uber.com/marmaray-hadoop-ingestion-open-sour ce/ ● https://github.com/uber/hudi ● https://eng.uber.com/michelangelo/ ● https://eng.uber.com/m3/

  40. Q & A?

  41. Follow our Facebook page: www.facebook.com/uberopensource

Recommend


More recommend