CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Submission Deadline for the GEAR Session 1 review • Feb 25 • Presenters: please upload (canvas) your slides at least 2 hours before the presentation session PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • 3. Distributed Computing Models for Scalable Batch Computing • Data Frame • Spark SQL • Datasets In-Memory Cluster Computing: Apache Spark • 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron SQL, DataFrames and Datasets • Apache storm model • Parallelism • Grouping methods CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What is the Spark SQL? What is the Datasets? • Spark module for structured data processing • Dataset is a distributed collection of data • Interface is provided by Spark • New interface added in Spark (since V1.6) provides • SQL and the Dataset API • Benefits of RDDs (Storing typing, ability to use lambda functions) • Benefits of Spark SQL’s optimized execution engine • Spark SQL is to execute SQL queries • Available with the command-line or over JDBC/ODBC • Available in Scala and Java • Python does not support Datasets APIs http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What is the DataFrames? • DataFrame is a Dataset organized into named columns • Like a table in a relational database or a data frame in R/Python • Strengthened optimization scheme In-Memory Cluster Computing: Apache Spark • Available with Scala, Java, Python, and R SQL, DataFrames and Datasets Getting Started CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Create a SparkSession: Starting Point Creating DataFrames • SparkSession • With a SparkSession, applications can create DataFrames from • The entry point into all functionality in Spark • Existing RDD • Hive table import org.apache.spark.sql.SparkSession • Spark data sources val df = spark.read.json("examples/src/main/resources/people.json") val spark = SparkSession .builder() // Displays the content of the DataFrame to stdout .appName("Spark SQL basic example") df.show() .config("spark.some.config.option", "some-value") .getOrCreate() // +----+-------+ // | age| name | // For implicit conversions like converting RDDs to DataFrames // +----+-------+ import spark.implicits._ // |null|Michael| // | 30| Andy | // | 19| Justin | // +----+-------+ Find full example code at the Spark repo examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) Untyped Dataset Operation (A.K.A. DataFrame Operations) • DataFrames are just Dataset of Rows in Scala and Java API // Select only the "name" column df.select("name").show() • Untyped transformations • “typed operations”? • Strongly typed Scala/Java Datasets // +-------+ // | name| // This import is needed to use the $-notation // +-------+ import spark.implicits._ // |Michael| // | Andy| // Print the schema in a tree format // | Justin| df.printSchema() // +-------+ // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation Untyped Dataset Operation (A.K.A. DataFrame Operations) // Select everybody, but increment the age by 1 // Select people older than 21 df.select($"name", $"age" + 1).show() df.filter($"age" > 21).show() // +-------+---------+ // +---+----+ // | name. |(age + 1)| // |age|name| // +-------+---------+ // +---+----+ // |Michael| null| // | 30|Andy| // | Andy. | 31| // +---+----+ // | Justin| 20| // +-------+---------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Running SQL Queries Global Temporary View • SELECT * FROM people • Temporary views in Spark SQL // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") • Session-scoped • Will disappear if the session that creates it terminates val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() • Global temporary view // +----+-------+ • Shared among all sessions and keep alive until the Spark application terminates // | age| name| // +----+-------+ • A system preserved database // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Global Temporary View Global Temporary View • SELECT * FROM people // Register the DataFrame as a global temporary view // Global temporary view is cross-session df.createGlobalTempView("people") spark.newSession().sql("SELECT * FROM global_temp.people").show() // Global temporary view is tied to a system preserved database `global_temp` // +----+-------+ spark.sql("SELECT * FROM global_temp.people").show() // | age| name. | // +----+-------+ // +----+-------+ // |null|Michael| // | age| name | // | 30 | Andy | // +----+-------+ // | 19 | Justin| // |null|Michael| // +----+-------+ // | 30 | Andy | // | 19 | Justin| // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Creating Datasets Creating Datasets • Datasets are similar to RDDs // Encoders for most common types are automatically provided by importing spark.implicits._ • Serializes object with Encoder (not standard java/Kryo serialization) • Datasets are using non-standard serialization library (Spark’s Encoder ) val primitiveDS = Seq(1, 2, 3).toDS() • Many of Spark Dataset operations can be performed without deserializing object primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) case class Person(name: String, age: Long) // DataFrames can be converted to a Dataset by providing a class. // Mapping will be done by name // Encoders are created for case classes val path = "examples/src/main/resources/people.json" val caseClassDS = Seq(Person("Andy", 32)).toDS() val peopleDS = spark.read.json(path).as[Person] caseClassDS.show() peopleDS.show() // +----+-------+ // +----+---+ // | age| name| // |name|age| // +----+-------+ // +----+---+ // |null|Michael| // |Andy| 32| // | 30 | Andy. | // +----+---+ // | 19 | Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs • Converting RDDs into Datasets • Case 1: Using reflections to infer the schema of an RDD • Case 2: Using a programmatic interface to construct a schema and then apply it to an existing RDD In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets Interacting with RDDs CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs: 1. Using Reflection Interoperating with RDDs: 1. Using Reflection • Automatic converting of an RDD (containing case classes) to a DataFrame // For implicit conversions from RDDs to DataFrames import spark.implicits._ • The case class defines the schema of the table // Create an RDD of Person objects from a text file, convert it to a Dataframe • E.g. the names of the arguments to the case class are read using reflection val peopleDF = spark.sparkContext .textFile("examples/src/main/resources/people.txt") • become the names of the columns .map(_.split(",")) .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) • Case classes can also be nested or contain complex types such as Seqs or Arrays .toDF() // Register the DataFrame as a temporary view • RDD will be implicitly converted to a DataFrame and then be registered as a table peopleDF.createOrReplaceTempView(" people ") // SQL statements can be run by using the sql methods provided by Spark val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Submission Deadline for the GEAR Session 1 review Feb 25 Presenters: please upload (canvas)

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/25/2020 Week 8-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/24/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/9/2020 Week 8-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

FAQs Lossy Algorithm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State

Java Technologies Web Listeners The Context Web Applications have a life cycle: they are

P rr Prt

Legal English in Practice Charlotte Oliver Solicitor of England and Wales Avvocato in Italy

Asynchronous programming & Crypto COMPSCI210 Recitation 25th Mar 2013 Vamsi Thummala

LCT: An Open Source Concolic Testing Tool for Java Programs Kari Khknen, Tuomas Launiainen,

Neural Module Networks for Reasoning Over Text Nitish Gupta , Kevin Lin , Dan Roth , Sameer Singh

HPSG Binding Theory David Lahm Deutsches Seminar - Eberhard Karls Universit at T ubingen

The Case for Specific Exemptions from the Goods and Services Tax: What should we do about Food,

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Submission Deadline for the GEAR Session 1 review Feb 25 Presenters: please upload (canvas)

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/25/2020 Week 8-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/24/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/9/2020 Week 8-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

FAQs Lossy Algorithm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State

Java Technologies Web Listeners The Context Web Applications have a life cycle: they are

P rr Prt

Legal English in Practice Charlotte Oliver Solicitor of England and Wales Avvocato in Italy

Asynchronous programming &amp; Crypto COMPSCI210 Recitation 25th Mar 2013 Vamsi Thummala

LCT: An Open Source Concolic Testing Tool for Java Programs Kari Khknen, Tuomas Launiainen,

Neural Module Networks for Reasoning Over Text Nitish Gupta , Kevin Lin , Dan Roth , Sameer Singh

HPSG Binding Theory David Lahm Deutsches Seminar - Eberhard Karls Universit at T ubingen

The Case for Specific Exemptions from the Goods and Services Tax: What should we do about Food,

Asynchronous programming & Crypto COMPSCI210 Recitation 25th Mar 2013 Vamsi Thummala