Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big - PowerPoint PPT Presentation

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016

Agenda ● Background ● Why Pig on Spark ? ● Design Architecture ● Benchmark ● Optimization ● Current Status & Future Work ● Q&A

Background

Apache Pig ● Procedural scripting language ● Pig Latin: similar to sql ● Heavily used for ETL ● Schema / No schema data, Pig eats everything

Spark ● Faster ● Generality ● Easy of use

Why Pig on Spark ● Better Performance ○ No intermediate data between stages ○ In-memory caching abstraction ○ Executor JVM Reuse ● Support Pig users to experience Spark conveniently

Design Architecture

Pig Latin to RDD<Tuple> transformations

Operator Mapping Pig Operator Spark Operator Load newAPIHadoopFile Store saveAsNewAPIHadoopFile Filter filter GroupBy groupby/reduceBy Join CoGroupRDD ForEach mapPartitions Sort sortByKey

Benchmark Overview Component Version Pig Spark branch Hadoop 2.6.0 Spark 1.6.2 PigMix Trunk

Basic Configuration spark.master=yarn-client spark.executor.memory=6553m spark.yarn.executor.memoryOverhead=1638 spark.executor.cores=8 spark.dynamicAllocation.enabled=true spark.network.timeout=1200000

Benchmark Overview (cont’d)

Optimize GroupBy/Join

Skewed Key Sort

Salted Key Solution

Skewed Key Sort Performance There are significant performance Improvement in sort case(L10) and skewed key sort case(L9)

Current Status: Nearing end of Milestone 1 ● Functional completeness: DONE ● All Unit Tests Pass: DONE ● Merge Spark Branch to Master: In Code Review

Ongoing Work towards Milestone 2 ● Implement Optimizations ○ Optimize Group by/Join - PIG-4797: DONE ○ FR Join - PIG-4771: DONE ○ Merge Join - PIG-4810: DONE ○ Skewed Join: UNDER REVIEW ● Enhance Test Infrastructure ○ Use “local-cluster” mode to run unit tests ● Spark Integration ○ Improved error, progress, stats reporting ○ YARN Cluster Mode

Future work: Milestone 3 ● Implement More Optimizations ○ Split / MultiQuery using RDD.cache() ○ Compute optimal Shuffle Parallelism ○ Optimize/Redesign Spark Plan ● Code Stablization, Bug Fixes

Contribution welcomed ● Git: ○ https://github.com/apache/pig/tree/spark ● Wiki : ○ https://cwiki.apache.org/confluence/display/PIG/Pig +on+Spark ● Umbrella jira: ○ PIG-4059

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big - PowerPoint PPT Presentation

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda Background Why Pig on Spark ? Design Architecture Benchmark Optimization Current Status & Future Work Q&A Background

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Scaling Up Pig Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

Same Questions across domains, different interpretations What is it? How do we study it?

Dictionaries CSSE 120Rose Hulman Institute of Technology Data Collections Frequently

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop

HelenOS in the Year of the Pig HelenOS in the Year of the Pig http://www.helenos.org

Massive Scale Magdalena Balazinska University of Washington

Distributed Streaming Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

Fo-An-Di-Qz system 2. This is known as the simple Basalt system since this quaternary system

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big - PowerPoint PPT Presentation

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda Background Why Pig on Spark ? Design Architecture Benchmark Optimization Current Status & Future Work Q&A Background

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Scaling Up Pig Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

Same Questions across domains, different interpretations What is it? How do we study it?

Dictionaries CSSE 120Rose Hulman Institute of Technology Data Collections Frequently

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop

HelenOS in the Year of the Pig HelenOS in the Year of the Pig http://www.helenos.org

Massive Scale Magdalena Balazinska University of Washington

Distributed Streaming Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

Fo-An-Di-Qz system 2. This is known as the simple Basalt system since this quaternary system

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark