CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Submission Deadline for the GEAR Session 1 review • Feb 25 • Presenters: please upload (canvas) your slides at least 2 hours before the presentation session PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • 3. Distributed Computing Models for Scalable Batch Computing • Data Frame • Spark SQL • Datasets In-Memory Cluster Computing: Apache Spark • 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron SQL, DataFrames and Datasets • Apache storm model • Parallelism • Grouping methods CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What is the Spark SQL? What is the Datasets? • Spark module for structured data processing • Dataset is a distributed collection of data • Interface is provided by Spark • New interface added in Spark (since V1.6) provides • SQL and the Dataset API • Benefits of RDDs (Storing typing, ability to use lambda functions) • Benefits of Spark SQL’s optimized execution engine • Spark SQL is to execute SQL queries • Available with the command-line or over JDBC/ODBC • Available in Scala and Java • Python does not support Datasets APIs http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What is the DataFrames? • DataFrame is a Dataset organized into named columns • Like a table in a relational database or a data frame in R/Python • Strengthened optimization scheme In-Memory Cluster Computing: Apache Spark • Available with Scala, Java, Python, and R SQL, DataFrames and Datasets Getting Started CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Create a SparkSession: Starting Point Creating DataFrames • SparkSession • With a SparkSession, applications can create DataFrames from • The entry point into all functionality in Spark • Existing RDD • Hive table import org.apache.spark.sql.SparkSession • Spark data sources val df = spark.read.json("examples/src/main/resources/people.json") val spark = SparkSession .builder() // Displays the content of the DataFrame to stdout .appName("Spark SQL basic example") df.show() .config("spark.some.config.option", "some-value") .getOrCreate() // +----+-------+ // | age| name | // For implicit conversions like converting RDDs to DataFrames // +----+-------+ import spark.implicits._ // |null|Michael| // | 30| Andy | // | 19| Justin | // +----+-------+ Find full example code at the Spark repo examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) Untyped Dataset Operation (A.K.A. DataFrame Operations) • DataFrames are just Dataset of Rows in Scala and Java API // Select only the "name" column df.select("name").show() • Untyped transformations • “typed operations”? • Strongly typed Scala/Java Datasets // +-------+ // | name| // This import is needed to use the $-notation // +-------+ import spark.implicits._ // |Michael| // | Andy| // Print the schema in a tree format // | Justin| df.printSchema() // +-------+ // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2
CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation Untyped Dataset Operation (A.K.A. DataFrame Operations) // Select everybody, but increment the age by 1 // Select people older than 21 df.select($"name", $"age" + 1).show() df.filter($"age" > 21).show() // +-------+---------+ // +---+----+ // | name. |(age + 1)| // |age|name| // +-------+---------+ // +---+----+ // |Michael| null| // | 30|Andy| // | Andy. | 31| // +---+----+ // | Justin| 20| // +-------+---------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Running SQL Queries Global Temporary View • SELECT * FROM people • Temporary views in Spark SQL // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") • Session-scoped • Will disappear if the session that creates it terminates val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() • Global temporary view // +----+-------+ • Shared among all sessions and keep alive until the Spark application terminates // | age| name| // +----+-------+ • A system preserved database // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Global Temporary View Global Temporary View • SELECT * FROM people // Register the DataFrame as a global temporary view // Global temporary view is cross-session df.createGlobalTempView("people") spark.newSession().sql("SELECT * FROM global_temp.people").show() // Global temporary view is tied to a system preserved database `global_temp` // +----+-------+ spark.sql("SELECT * FROM global_temp.people").show() // | age| name. | // +----+-------+ // +----+-------+ // |null|Michael| // | age| name | // | 30 | Andy | // +----+-------+ // | 19 | Justin| // |null|Michael| // +----+-------+ // | 30 | Andy | // | 19 | Justin| // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3
CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Creating Datasets Creating Datasets • Datasets are similar to RDDs // Encoders for most common types are automatically provided by importing spark.implicits._ • Serializes object with Encoder (not standard java/Kryo serialization) • Datasets are using non-standard serialization library (Spark’s Encoder ) val primitiveDS = Seq(1, 2, 3).toDS() • Many of Spark Dataset operations can be performed without deserializing object primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) case class Person(name: String, age: Long) // DataFrames can be converted to a Dataset by providing a class. // Mapping will be done by name // Encoders are created for case classes val path = "examples/src/main/resources/people.json" val caseClassDS = Seq(Person("Andy", 32)).toDS() val peopleDS = spark.read.json(path).as[Person] caseClassDS.show() peopleDS.show() // +----+-------+ // +----+---+ // | age| name| // |name|age| // +----+-------+ // +----+---+ // |null|Michael| // |Andy| 32| // | 30 | Andy. | // +----+---+ // | 19 | Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs • Converting RDDs into Datasets • Case 1: Using reflections to infer the schema of an RDD • Case 2: Using a programmatic interface to construct a schema and then apply it to an existing RDD In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets Interacting with RDDs CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs: 1. Using Reflection Interoperating with RDDs: 1. Using Reflection • Automatic converting of an RDD (containing case classes) to a DataFrame // For implicit conversions from RDDs to DataFrames import spark.implicits._ • The case class defines the schema of the table // Create an RDD of Person objects from a text file, convert it to a Dataframe • E.g. the names of the arguments to the case class are read using reflection val peopleDF = spark.sparkContext .textFile("examples/src/main/resources/people.txt") • become the names of the columns .map(_.split(",")) .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) • Case classes can also be nested or contain complex types such as Seqs or Arrays .toDF() // Register the DataFrame as a temporary view • RDD will be implicitly converted to a DataFrame and then be registered as a table peopleDF.createOrReplaceTempView(" people ") // SQL statements can be run by using the sql methods provided by Spark val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4
Recommend
More recommend