Spark & Spark SQL High-Speed In-Memory Analytics over - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark ¡& ¡ Spark ¡SQL High-‑Speed ¡In-‑Memory ¡Analytics   over ¡Hadoop ¡and ¡Hive ¡Data Instructor: Duen Horng (Polo) Chau 1 Slides ¡adopted ¡from ¡Matei ¡Zaharia ¡(MIT) ¡and ¡Oliver ¡Vagner ¡(Manheim, ¡GT)

What ¡is ¡Spark ¡ ¡ ¡? http://spark.apache.org Not ¡a ¡modified ¡version ¡of ¡Hadoop ¡ Separate , ¡fast, ¡MapReduce-‑like ¡engine ¡ » In-‑memory ¡ data ¡storage ¡for ¡very ¡fast ¡iterative ¡queries ¡ » General ¡execution ¡graphs ¡and ¡powerful ¡optimizations ¡ » Up ¡to ¡40x ¡faster ¡than ¡Hadoop ¡ Compatible ¡with ¡Hadoop’s ¡storage ¡APIs ¡ » Can ¡read/write ¡to ¡any ¡Hadoop-‑supported ¡system, ¡ including ¡HDFS, ¡HBase, ¡SequenceFiles, ¡etc 2

What ¡is ¡Spark ¡SQL? ¡   (Formally ¡called ¡Shark) Port ¡of ¡Apache ¡Hive ¡to ¡run ¡on ¡Spark ¡ Compatible ¡with ¡existing ¡Hive ¡data, ¡metastores, ¡ and ¡queries ¡(HiveQL, ¡UDFs, ¡etc) ¡ Similar ¡speedups ¡of ¡up ¡to ¡40x 3

Project ¡History ¡ [latest: ¡v1.1] Spark ¡project ¡started ¡in ¡2009 ¡at ¡UC ¡Berkeley ¡AMP ¡lab, ¡open ¡ sourced ¡2010 ¡ UC ¡BERKELEY Became ¡Apache ¡Top-‑Level ¡Project ¡in ¡Feb ¡2014 ¡ Shark/Spark ¡SQL ¡started ¡summer ¡2011 ¡ Built ¡by ¡250+ ¡developers ¡and ¡people ¡from ¡50 ¡companies ¡ Scale ¡to ¡1000+ ¡nodes ¡in ¡production ¡ In ¡use ¡at ¡Berkeley, ¡Princeton, ¡Klout, ¡Foursquare, ¡Conviva, ¡ Quantifind, ¡Yahoo! ¡Research, ¡… 4 http://en.wikipedia.org/wiki/Apache_Spark

Why ¡a ¡New ¡Programming ¡Model? MapReduce ¡greatly ¡simplified ¡big ¡data ¡analysis But ¡as ¡soon ¡as ¡it ¡got ¡popular, ¡users ¡wanted ¡more: » More ¡ complex , ¡multi-‑stage ¡applications ¡(e.g.   iterative ¡graph ¡algorithms ¡and ¡machine ¡learning) » More ¡ interactive ¡ad-‑hoc ¡queries 5

Why ¡a ¡New ¡Programming ¡Model? MapReduce ¡greatly ¡simplified ¡big ¡data ¡analysis But ¡as ¡soon ¡as ¡it ¡got ¡popular, ¡users ¡wanted ¡more: » More ¡ complex , ¡multi-‑stage ¡applications ¡(e.g.   iterative ¡graph ¡algorithms ¡and ¡machine ¡learning) » More ¡ interactive ¡ad-‑hoc ¡queries Require ¡faster ¡ data ¡sharing ¡ across ¡parallel ¡jobs 5

Up for debate … as of 10/7/2014 Is ¡MapReduce ¡dead? http://www.datacenterknowledge.com/archives/ 2014/06/25/google-dumps-mapreduce-favor-new- hyper-scale-analytics-system/ http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/ 6

Data ¡Sharing ¡in ¡MapReduce HDFS   HDFS   HDFS   HDFS   read write read write . ¡ ¡. ¡ ¡. iter. ¡1 iter. ¡2 Input result ¡1 query ¡1 HDFS   read result ¡2 query ¡2 query ¡3 result ¡3 Input . ¡ ¡. ¡ ¡. 7

Data ¡Sharing ¡in ¡MapReduce HDFS   HDFS   HDFS   HDFS   read write read write . ¡ ¡. ¡ ¡. iter. ¡1 iter. ¡2 Input result ¡1 query ¡1 HDFS   read result ¡2 query ¡2 query ¡3 result ¡3 Input . ¡ ¡. ¡ ¡. Slow ¡due ¡to ¡replication, ¡serialization, ¡and ¡disk ¡IO 7

Data ¡Sharing ¡in ¡Spark iter. ¡1 iter. ¡2 . ¡ ¡. ¡ ¡. Input query ¡1 one-‑time   processing query ¡2 query ¡3 Input Distributed   . ¡ ¡. ¡ ¡. memory 8

Data ¡Sharing ¡in ¡Spark iter. ¡1 iter. ¡2 . ¡ ¡. ¡ ¡. Input query ¡1 one-‑time   processing query ¡2 query ¡3 Input Distributed   . ¡ ¡. ¡ ¡. memory 10-‑100 × ¡ faster ¡than ¡network ¡and ¡disk 8

Spark ¡Programming ¡Model Key ¡idea: ¡ resilient ¡distributed ¡datasets ¡(RDDs) ¡ » Distributed ¡collections ¡of ¡objects ¡that ¡can ¡be ¡cached ¡in ¡ memory ¡across ¡cluster ¡nodes ¡ » Manipulated ¡through ¡various ¡parallel ¡operators ¡ » Automatically ¡rebuilt ¡on ¡failure ¡ Interface ¡ » Clean ¡language-‑integrated ¡API ¡in ¡Scala ¡ » Can ¡be ¡used ¡ interactively ¡from ¡Scala, ¡Python ¡console ¡ » Supported ¡languages: ¡Java, ¡Scala, ¡Python 9

http://www.scala-lang.org/old/faq/4 Java vs Scala: http://www.toptal.com/scala/why-should-i-learn-scala 10

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns Worker Driver Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns Base ¡RDD lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Transformed ¡RDD Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Block ¡1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Block ¡2 Worker Block ¡3 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) tasks Block ¡1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Block ¡2 Worker Block ¡3 11

Spark & Spark SQL High-Speed In-Memory Analytics over - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

SPARK G N I T E R R A M E V I T A E C u s t o m L o g o P r o j e c t Introduction

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

SPARK Program Presented by: The Engagement & Recognition Committee What is a SPARK Award?

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Spark & Spark SQL High-Speed In-Memory Analytics over - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

SPARK G N I T E R R A M E V I T A E C u s t o m L o g o P r o j e c t Introduction

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

SPARK Program Presented by: The Engagement &amp; Recognition Committee What is a SPARK Award?

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

SPARK Program Presented by: The Engagement & Recognition Committee What is a SPARK Award?