A Smarter Pig: Building a SQL interface to Pig using Apache Calcite - PowerPoint PPT Presentation

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde Apache: Big Data, Miami 2017/05/16

About us Eli Levine @teleturn PMC member of Phoenix ASF member Julian Hyde @julianhyde Original developer of Calcite PMC member of Calcite, Drill, Eagle, Kylin ASF member

Apache Calcite Apache top-level project since October, 2015 Query planning framework Relational algebra, rewrite rules ➢ Cost model & statistics ➢ Federation via adapters ➢ Extensible ➢ Packaging Library ➢ Optional SQL parser, JDBC server ➢ Community-authored rules, adapters ➢

Apache Pig Apache top-level project Platform for Analyzing Large Datasets Uses Pig Latin language ➢ ○ Relational operators (join, filter) ○ Functional operators (map, reduce) Runs as MapReduce (also Tez) ➢ ETL ➢ Extensible ➢ ○ LOAD/STORE UDFs ○

Outline Batch compute on Force.com Platform (Eli Levine) Apache Calcite deep dive (Julian Hyde) Building Pig adapter for Calcite (Eli Levine) Q&A

Salesforce Platform Object-relational data model in the cloud Contains standard objects that users can customize or add their own SQL-like query language SOQL SELECT DEPT.NAME - Real-time FROM EMPLOYEE - Batch/ETL WHERE FIRST_NAME = ‘Eli’ Federated data store: Oracle, HBase, external User queries span data sources (federated joins)

Salesforce Platform - Current Batch/ETL Called Async SOQL - REST API - Users supply SOQL and info about where to deposit results SOQL -> Pig Latin script Pig loaders move data/computation to HDFS for federated query execution Own SOQL parsing, no Calcite

Query Planning in Async SOQL Current Next generation

Apache Calcite for Next-Gen Optimizer - Strong relational algebra foundation - Support for different physical engines - Pluggable cost model - Optimization rules - Federation-aware

[Julian talks about Calcite] - Building blocks for building query engine or DB - Federation-aware - Represent and optimize logical plan - Convert to physical plan

select p.productName, count (*) as c Planning queries from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Splunk Table: splunk Key: productId Key: productName Key: c desc Condition: Agg: count action = 'purchase' scan join MySQL filter group sort scan Table: products

select p.productName, count (*) as c Optimized query from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Splunk Table: splunk Condition: action = 'purchase' Key: productName Key: productId Key: c desc Agg: count scan filter MySQL join group sort scan Table: products

Adapters Connect to a data source How to push down logic to the data source? A set of planner rules Maybe also a calling convention Maybe also a query model & query generator

Calcite Pig Adapter SELECT DEPT_ID FROM EMPLOYEE GROUP BY DEPT_ID HAVING COUNT(DEPT_ID) > 10 EMPLOYEE = LOAD 'EMPLOYEE' ... ; EMPLOYEE = GROUP EMPLOYEE BY (DEPT_ID); EMPLOYEE = FOREACH EMPLOYEE GENERATE COUNT (EMPLOYEE.DEPT_ID) as DEPT_ID__COUNT_, group as DEPT_ID; EMPLOYEE = FILTER EMPLOYEE BY (DEPT_ID__COUNT_ > 10);

Building the Pig Adapter 1. Implement Pig-specific RelNodes. e.g. PigFilter 2. RelNode Factories 3. Various RelOptRule types for converting Abstract RelNodes to PigRelNodes 4. Schema implementation (several iterations) 5. Unit tests run local Pig

Lessons Learned Calcite is very flexible (both good and bad) Lots available out of the box Little info on writing optimization and conversion rules Dynamic code generation using Janino -- cryptic errors SQL maps well to Pig. Inverse not always true.

Thank you! Eli Levine @teleturn Julian Hyde @julianhyde http://calcite.apache.org http://pig.apache.org

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite - PowerPoint PPT Presentation

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde Apache: Big Data, Miami 2017/05/16 About us Eli Levine @teleturn PMC member of Phoenix ASF member Julian Hyde @julianhyde Original developer

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SQL SQL SQL = Structured Query Language Standard query language for relational

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

SMARTER LIMITED An international brokering, business mentoring and professional services company

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

The Beijing Tier-2 Site: current status and plans Lu Wang, Computing Center Institute of High

HARNESS THE EDGE with StorMagic SvSAN & HPE Edgeline MAKING THE COMPLEX SIMPLE MAKING THE

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Q4 & FY2018 Highlights TSX:TGZ / OTCQX:TGCDF Friday, February 22, 2019 Forward-Looking

D. Frekers Charge-exchange reactions GT-transitions, -decay and Flux @ 1 AU [cm -1

Question Answering (continued) Lecture 22: November 22, 2013 CS886 2 Natural Language

Introduction to Cheyenne 3 November, 2016 Consulting Services Group Brian Vanderwende Topics we

Planning and Optimization A2. What is Planning? Gabriele R oger and Thomas Keller Universit

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite - PowerPoint PPT Presentation

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde Apache: Big Data, Miami 2017/05/16 About us Eli Levine @teleturn PMC member of Phoenix ASF member Julian Hyde @julianhyde Original developer

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SQL SQL SQL = Structured Query Language Standard query language for relational

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

SMARTER LIMITED An international brokering, business mentoring and professional services company

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

The Beijing Tier-2 Site: current status and plans Lu Wang, Computing Center Institute of High

HARNESS THE EDGE with StorMagic SvSAN &amp; HPE Edgeline MAKING THE COMPLEX SIMPLE MAKING THE

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Q4 &amp; FY2018 Highlights TSX:TGZ / OTCQX:TGCDF Friday, February 22, 2019 Forward-Looking

D. Frekers Charge-exchange reactions GT-transitions, -decay and Flux @ 1 AU [cm -1

Question Answering (continued) Lecture 22: November 22, 2013 CS886 2 Natural Language

Introduction to Cheyenne 3 November, 2016 Consulting Services Group Brian Vanderwende Topics we

Planning and Optimization A2. What is Planning? Gabriele R oger and Thomas Keller Universit

HARNESS THE EDGE with StorMagic SvSAN & HPE Edgeline MAKING THE COMPLEX SIMPLE MAKING THE

Q4 & FY2018 Highlights TSX:TGZ / OTCQX:TGCDF Friday, February 22, 2019 Forward-Looking