SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive - PowerPoint PPT Presentation

SparkSQL 11/14/2018 1

Where are we? Pig Latin HiveQL … Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2

Where are we? Pig Latin HiveQL SQL … Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3

Shark (Spark on Hive) A small side project that aimed to running RDD jobs on Hive data using HiveQL Still limited to the data model of Hive Tied to the Hadoop world 11/14/2018 4

SparkSQL Redesigned to consider Spark query model Supports all the popular relational operators Can be intermixed with RDD operations Uses the Dataframe API as an enhancement to the RDD API Dataframe = RDD + schema 11/14/2018 5

Dataframes SparkSQL’s counterpart to relations or tables in RDMBS Consists of rows and columns A dataframe is NOT in 1NF Why? Can be created from various data sources CSV file JSON file MySQL database Hive 11/14/2018 6

Dataframe Vs RDD Dataframe RDD Lazy execution Lazy execution Spark is aware of The data model is the data model hidden from Spark Spark is aware of The transformations the query logic and actions are black boxes Cannot optimize the Can optimize the query query 11/14/2018 7

Built-in operations in SprkSQL Filter (Selection) Select (Projection) Join GroupBy (Aggregation) Load/Store in various formats Cache Conversion between RDD (back and forth) 11/14/2018 8

SparkSQL Examples 11/14/2018 9

Project Setup # In dependencies pom.xml  <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.1</version> </dependency> 11/14/2018 10

Code Setup SparkSession sparkS = SparkSession .builder() .appName("Spark SQL examples") .master("local") .getOrCreate(); Dataset<Row> log_file = sparkS.read() .option("delimiter", "\t") .option("header", "true") .option("inferSchema", "true") .csv("nasa_log.tsv"); log_file.show(); 11/14/2018 11

Filter Example # Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response"); 11/14/2018 12

SparkSQL Features Catalyst query optimizer Code generation Integration with libraries 11/14/2018 13

SparkSQL Query Plan Logical Physical Code Analysis Optimization Planning Generation Cost Model SQL AST Selected Unresolved Optimized Physical Logical Plan Physical RDDs Logical Plan Logical Plan Plans Plan DataFrame Catalog DataFrames and SQL share the same optimization/execution pipeline 14 Credits: M. Armbrust

Catalyst Query Optimizer Extensible rule-based optimizer Users can define their own rules Original Filter Plan Push-Down Project Project name name Filter Project id = 1 id,name Project Filter id,name id = 1 People People 11/14/2018 15

Code Generation Shift from black-box UDF to Expressions Example # Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response"); SparkSQL understand the logic of user queries and rewrites them in a more concise way 11/14/2018 16

Integration SparkSQL is integrated with other high-level interfaces such as MLlib, PySpark, and SparkR SparkSQL is also integrated with the RDD interface and they can be mixed in one program 11/14/2018 17

Further Reading Documentation http://spark.apache.org/docs/latest/sql- programming-guide.html SparkSQL paper M. Armbrust et al . "Spark sql: Relational data processing in spark." SIGMOD 2015 11/14/2018 18

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive - PowerPoint PPT Presentation

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2 Where are we? Pig Latin HiveQL SQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3 Shark

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS)

Topics not Covered 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks)

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of

Neues in Open-Source-SQL-Datenbanken @MarkusWinand @ModernSQL

Views Views In some cases, it is not desirable for all users to see the entire logical model

SQL:Queries,Programming, Triggers Chapter5

Bringing SQL to the Masses with Program Synthesis Chenglong Wang, Alvin Cheung, Ras Bodik

ECPR Methods Summer School: Big Data Analysis in the Social Sciences Pablo Barber a London

Supporting Time-Constrained SQL Queries in Oracle Ying Hu, Seema Sundara, Jagannathan Srinivasan

SQL Database Systems: The Complete Book Ch 2.3, 6.1-6.4 1 Project Outline ??? Parser &

SQL Injection Attacks: A Quick Primer hassan.abudu@owasp.org OWASP Top 10 Vulnerabilities - 2017

Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values

Database Systems SQL Based on slides by Feifei Li, University of Utah The SQL Query Language n

SQL Workshop Introduction Queries Doug Shook SQL Server As its name implies: its a

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Declarative Languages columns and rows name and a type 38 122 Berkeley A row has a value 42

A Tutorial on A Tutorial on SQL Server 2005 SQL Server 2005 CMPT 354 CMPT 354 Fall 2007 Fall

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud Santa Clara, California | April

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 1 9 : L

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive - PowerPoint PPT Presentation

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2 Where are we? Pig Latin HiveQL SQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3 Shark

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS)

Topics not Covered 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks)

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of

Neues in Open-Source-SQL-Datenbanken @MarkusWinand @ModernSQL

Views Views In some cases, it is not desirable for all users to see the entire logical model

SQL:Queries,Programming, Triggers Chapter5

Bringing SQL to the Masses with Program Synthesis Chenglong Wang, Alvin Cheung, Ras Bodik

ECPR Methods Summer School: Big Data Analysis in the Social Sciences Pablo Barber a London

Supporting Time-Constrained SQL Queries in Oracle Ying Hu, Seema Sundara, Jagannathan Srinivasan

SQL Database Systems: The Complete Book Ch 2.3, 6.1-6.4 1 Project Outline ??? Parser &amp;

SQL Injection Attacks: A Quick Primer hassan.abudu@owasp.org OWASP Top 10 Vulnerabilities - 2017

Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values

Database Systems SQL Based on slides by Feifei Li, University of Utah The SQL Query Language n

SQL Workshop Introduction Queries Doug Shook SQL Server As its name implies: its a

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Declarative Languages columns and rows name and a type 38 122 Berkeley A row has a value 42

A Tutorial on A Tutorial on SQL Server 2005 SQL Server 2005 CMPT 354 CMPT 354 Fall 2007 Fall

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud Santa Clara, California | April

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 1 9 : L

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

SQL Database Systems: The Complete Book Ch 2.3, 6.1-6.4 1 Project Outline ??? Parser &