BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE Dean - PDF document

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE Dean Wampler, Ph.D. Typesafe Monday, September 29, 14

Dean Wampler Programming Hive Functional Programming for Java Developers Dean Wampler, Jason Rutherglen & Dean Wampler Edward Capriolo dean.wampler@typesafe.com polyglotprogramming.com/talks @deanwampler 2 Monday, September 29, 14 About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com.

It’s 2014 3 Monday, September 29, 14

Hadoop has been very successful . 4 Monday, September 29, 14

But it’s not perfect 5 Monday, September 29, 14

MapReduce 6 Monday, September 29, 14 The limitations of MapReduce have become increasingly significant...

1 Map step + 1 Reduce step 7 Monday, September 29, 14 You ¡get ¡one ¡map ¡step ¡and ¡one ¡reduce ¡step. ¡You ¡can ¡sequence ¡jobs ¡together ¡when ¡ necessary.

Problems Limited programming model 8 Monday, September 29, 14 MapReduce is a restrictive model. Writing jobs requires arcane, specialized skills that few master. It’s not easy mapping arbitrary algorithms to this model. For example, algorithms that are naturally iterative are especially hard, because MR doesn’t support iteration efficiently. For a good overview, see http://lintool.github.io/MapReduceAlgorithms/. The limited model doesn’t just impede developer productivity, it makes it much harder to build tools on top of the model, as we’ll discuss.

Problems The Hadoop API is horrible 9 Monday, September 29, 14 And Hadoop’s Java API only makes the problem harder, because it’s very low level and offers limited or no support for many common idioms.

Example Inverted Index 10 Monday, September 29, 14 See compare and contrast MR with Spark, let’s use this classic algorithm.

Inverted Index Web Crawl C Inve index wikipedia.org/hadoop block Hadoop provides MapReduce and HDFS ... ... wikipedia.org/hadoop Hadoop provides... ... ... ... block wikipedia.org/hbase ... ... Mi HBase stores data in HDFS wikipedia.org/hbase HBase stores... ... ... ... block ... ... wikipedia.org/hive wikipedia.org/hive Hive queries... Hive queries HDFS files and ... ... HBase tables with SQL 11 Monday, September 29, 14 First ¡we ¡crawl ¡the ¡web ¡to ¡build ¡a ¡data ¡set ¡of ¡document ¡names/ids ¡and ¡their ¡contents. ¡Then ¡we ¡“invert” ¡it; ¡we ¡tokenize ¡the ¡contents ¡into ¡words ¡and ¡build ¡a ¡new ¡index ¡from ¡each ¡word ¡to ¡the ¡list ¡of ¡documents ¡that ¡contain ¡the ¡ word ¡and ¡the ¡count ¡in ¡each ¡document. ¡This ¡is ¡a ¡basic ¡data ¡set ¡for ¡search ¡engines.

Inverted Index Compute Inverted Index inverse index block ... ... hadoop (.../hadoop,1) provides... hbase (.../hbase,1),(.../hive,1) hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1) hive (.../hive,1) ... ... Miracle!! stores... block ... ... block ... ... eries... block ... ... 12 and (.../hadoop,1),(.../hive,1) Monday, September 29, 14 First ¡we ¡crawl ¡the ¡web ¡to ¡build ¡a ¡data ¡set ¡of ¡document ¡names/ids ¡and ¡their ¡contents. ¡Then ¡we ¡“invert” ¡it; ¡we ¡tokenize ¡the ¡contents ¡into ¡words ¡and ¡build ¡a ¡new ¡index ¡from ¡each ¡word ¡to ¡the ¡list ¡of ¡documents ¡that ¡contain ¡the ¡ word ¡and ¡the ¡count ¡in ¡each ¡document. ¡This ¡is ¡a ¡basic ¡data ¡set ¡for ¡search ¡engines.

Inverted Index Web Crawl Compute Inverted Index index inverse index wikipedia.org/hadoop block block Hadoop provides MapReduce and HDFS ... ... ... ... hadoop (.../hadoop,1) wikipedia.org/hadoop Hadoop provides... hbase (.../hbase,1),(.../hive,1) ... ... ... hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1) hive (.../hive,1) block wikipedia.org/hbase ... ... ... ... Miracle!! HBase stores data in HDFS wikipedia.org/hbase HBase stores... block ... ... ... ... ... block block ... ... ... ... wikipedia.org/hive wikipedia.org/hive Hive queries... Hive queries HDFS files and ... ... HBase tables with SQL block ... ... and (.../hadoop,1),(.../hive,1) ... ... Altogether 13 Monday, September 29, 14 We’ll ¡implement ¡the ¡“miracle”.

import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); 14 conf.setOutputValueClass(Text.class); Monday, September 29, 14 I’m ¡not ¡going ¡to ¡explain ¡many ¡of ¡the ¡details. ¡The ¡point ¡is ¡to ¡noQce ¡all ¡the ¡boilerplate ¡that ¡obscures ¡the ¡problem ¡logic. Everything ¡is ¡in ¡one ¡outer ¡class. ¡We ¡start ¡with ¡a ¡main ¡rouQne ¡that ¡sets ¡up ¡the ¡job. I ¡used ¡yellow ¡for ¡method ¡calls, ¡because ¡methods ¡do ¡the ¡real ¡work!! ¡But ¡noQce ¡that ¡most ¡of ¡the ¡funcQons ¡in ¡this ¡code ¡don’t ¡really ¡do ¡a ¡whole ¡lot ¡of ¡work ¡for ¡us...

JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); 15 Monday, September 29, 14 boilerplate...

LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { 16 private final static Text word = Monday, September 29, 14 main ends with a try-catch clause to run the job.

public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); 17 String fileName = Monday, September 29, 14 This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name) tuples.

Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } 18 public static class LineIndexReducer Monday, September 29, 14 The rest of the LineIndexMapper class and map method.

public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); 19 } Monday, September 29, 14 The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.

Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } 20 Monday, September 29, 14 EOF

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE Dean - PDF document

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE Dean Wampler, Ph.D. Typesafe Monday, September 29, 14 Dean Wampler Programming Hive Functional Programming for Java Developers Dean Wampler, Jason Rutherglen & Dean Wampler

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

The Quartic Matrix Model: Transseries, Resurgence and Resummation Stokes Phenomenon, Resurgence

Resurgence in Quantum Theories: Resurgence Real Transseries Perturbative Theory and Beyond Airy

Resurgence of Instantons in Resurgence Applications String Theory Summary/Future Directions

Resurgence: Healing by Loving Blackness BY JAMILA DANIEL NOVEMBER 30, 2017 Resurgence: Healing

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

SQL SQL SQL = Structured Query Language Standard query language for relational

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

ES2 : Managing Link Level Parameters for Elevating Data Rate and Stability in High Throughput WLAN

Library of Congress Classification Module 12.4 Subarranging Literary Authors Policy, Training,

Evaluating CoDel, FQ_CoDel and PIE: how good are they really? Naeem Khademi

A SHORT NOTE ABOUT THE CZECH LANGUAGE HOSTED BY A short note about the Czech language Czech

Designing Classes Check out DesigningClasses project from SVN It starts with good classes

Engineering Code Obfuscation ISSISP 2017 - Obfuscation I Christian Collberg Department of

FREIA Test Stand Status and Outlook Roger Ruber for the FREIA Laboratory Uppsala University

1 Peter Series Lesson #098 July 6, 2017 Dean Bible Ministries www.deanbibleministries.org Dr.

Sambuz

Useful Links

Newsletter

Mail Us

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE Dean - PDF document

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE Dean Wampler, Ph.D. Typesafe Monday, September 29, 14 Dean Wampler Programming Hive Functional Programming for Java Developers Dean Wampler, Jason Rutherglen & Dean Wampler

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

The Quartic Matrix Model: Transseries, Resurgence and Resummation Stokes Phenomenon, Resurgence

Resurgence in Quantum Theories: Resurgence Real Transseries Perturbative Theory and Beyond Airy

Resurgence of Instantons in Resurgence Applications String Theory Summary/Future Directions

Resurgence: Healing by Loving Blackness BY JAMILA DANIEL NOVEMBER 30, 2017 Resurgence: Healing

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

SQL SQL SQL = Structured Query Language Standard query language for relational

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

ES2 : Managing Link Level Parameters for Elevating Data Rate and Stability in High Throughput WLAN

Library of Congress Classification Module 12.4 Subarranging Literary Authors Policy, Training,

Evaluating CoDel, FQ_CoDel and PIE: how good are they really? Naeem Khademi

A SHORT NOTE ABOUT THE CZECH LANGUAGE HOSTED BY A short note about the Czech language Czech

Designing Classes Check out DesigningClasses project from SVN It starts with good classes

Engineering Code Obfuscation ISSISP 2017 - Obfuscation I Christian Collberg Department of

FREIA Test Stand Status and Outlook Roger Ruber for the FREIA Laboratory Uppsala University

1 Peter Series Lesson #098 July 6, 2017 Dean Bible Ministries www.deanbibleministries.org Dr.

Sambuz

Useful Links

Newsletter

Mail Us

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark