cs535 big data 2 5 2020 week 3 b sangmi lee pallickara
play

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs PA1 GEAR Session 1 signup is


  1. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • PA1 • GEAR Session 1 signup is available: PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR • See the announcement in canvas SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER • Feedback policy COMPUTING • Quiz, TP proposal: 1week • Email- 24hrs Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • 3. Distributed Computing Models for Scalable Batch Computing • Introduction to Spark • Operations: transformations, actions, persistence In-Memory Cluster Computing: Apache Spark RDD (Resilient Distributed Dataset) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University RDD (Resilient Distributed Dataset) Creating RDDs [1/3] • Read-only, memory resident partitioned collection of records • Loading an external dataset • A fault-tolerant collection of elements that can be operated on in parallel val lines = sc.textFile("/path/to/README.md") • RDDs are the core unit of data in Spark • Most Spark programming involves performing operations on RDDs • Parallelizing a collection in your driver program val lines = sc.parallelize(List("pandas", "i like pandas")) https://spark.apache.org/docs/latest/rdd-programming-guide.html http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1

  2. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Creating RDDs [2/3] Creating RDDs [3/3] 1: val lines = sc.textFile("data.txt") lineLengths.persist() 2: val lineLengths = lines.map(s => s.length) 3: val totalLength = lineLengths.reduce((a, b) => a + b) • If you want to use lineLengths again later • Line 1: defines a base RDD from an external file • This dataset is not loaded in memory • Line 2: defines lineLengths as the result of map transformation • It is not immediately computed • Line 3: performs reduce and compute the results CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark Programming Interface to RDD [1/3] Spark Programming Interface to RDD [2/3] 1: val lines = sc.textFile("data.txt") 1: val lines = sc.textFile("data.txt") 2: val lineLengths = lines. map (s => s.length) 2: val lineLengths = lines.map(s => s.length) • transformations 3: val totalLength = lineLengths.reduce((a, b) => a + b) 3: val totalLength = lineLengths. reduce( (a, b) => a + b) • actions • Operations that create RDDs • Return pointers to new RDDs • Operations that return a value to the application or export data to a storage system • e.g. map, filter, and join • e.g. count : returns the number of elements in the dataset • e.g. collect : returns the elements themselves • RDDs can only be created through deterministic operations on either • e.g. save : outputs the dataset to a storage system • Data in stable storage • Other RDDs Don’t be confused with “transformation” of the Scala language Don’t be confused with “transformation” of the Scala language CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark Programming Interface to RDD [3/3] lineLengths. persist( ) • persistance In-Memory Cluster Computing: Apache Spark • Indicates which RDDs they want to reuse in future operations RDD: Transformations • Spark keeps persistent RDDs in memory by default RDD: Actions • If there is not enough RAM RDD: Persistence • It can spill them to disk • Users are allowed to, • store the RDD only on disk • replicate the RDD across machines • specify a persistence priority on each RDD http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2

  3. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What are the Transformations? • RDD’s transformations create a new dataset from an existing one • Simple transformations In-Memory Cluster Computing: Apache Spark • Transformations with multiple RDDs RDD: Transformations • Transformations with the Pair RDDs 1. Simple transformations 2. Transformations with multiple RDDs 3. Transformations with the Pair RDDs CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. filter() [1/2] Simple Transformations • These transformations create a new RDD from an existing RDD • The map( func ) transformation takes in a function • E.g. map(), filter(), flatMap(), sample() • applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD • The filter( func ) transformation takes in a function and returns an RDD that only has elements that pass the filter() function inputRDD {1,2,3,4} map x=> x*x filter x !=1 MappedRDD filteredRDD {1,4,9,16} {2,3,4} CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. filter() [1/2] map() vs. flatMap() [1/2] • map() that squares all of the numbers in an RDD • As results of flatMap() , we have an RDD of the elements • Instead of RDD of lists of elements RDD1.map(tokenize) mappedRDD val input = sc.parallelize( List( 1, 2, 3, 4)) {[“coffee”,”panda”],[“happy”,”pa RDD1 val result = input.map( x = > x * x) nda”],[“happiest”,”panda”,”party {“coffee panda”, “happy println(result.collect().mkString(",")) ”]} panda”, ”happiest panda party”} flatMappedRDD RDD1.flatMap(tokenize) {“coffee”,”panda”,“happy”,”pand a”,“happiest”,”panda”,”party”} http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3

  4. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. flatMap() [2/2] Example [1/2] Colorado State University Ohio State University • Using flatMap() that splits lines to multiple words Washington State University Boston University val lines = sc.parallelize( List(" hello world", "hi")) val words = lines.flatMap( line = > line.split(" ")) >>> wc = data.map(lambda line:line.split(" ")); words.first() // returns "hello" • Using map >>> llst = wc.collect() # print the list for line in llist: print line Output? >>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() • Using flatMap # print the list for line in llist: print line Output? CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. mapPartition() [2/2] Example --Answers [2/2] Colorado State University Ohio State University • map( func ) converts each element of the source RDD into a single element of the Washington State University result RDD by applying a function Boston University • mapPartitions( func ) converts each partition of the source RDD into multiple Colorado state university elements of the result (possibly none) Ohio state university • Using map Washington state university • Similar to map, but runs separately on each partition (block) of the RDD Boston university • func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. Colorado State val sc = new SparkContext(master,"BasicAvgMapPartitions”, • Using flatMap University System.getenv("SPARK_HOME")) Ohio val input = sc.parallelize(List(1, 2, 3, 4)) State val result = input.mapPartitions(partition => University Iterator(AvgCount(0, 0).merge(partition))) Washington .reduce((x,y) => x.merge(y)) State println(result) University Boston University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. mapPartition(): Performance repartition() vs. coalesce() • Does map() perform faster than mapPartition() ? • repartition(numParitions) • Assume that they are performed over the same RDD with the same number of partitions • Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them • This always shuffles all data over the network • coalesce(numPartitions) • Decrease the number of partitions in the RDD to numPartitions • Useful for running operations more efficiently after filtering down a large dataset http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4

Recommend


More recommend