cs535 big data 2 10 2019 week 4 a sangmi lee pallickara
play

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs PA2 description will be posted this


  1. CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • PA2 description will be posted this week • Weekly Reading List PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR • [W4R1] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev SCALABLE BATCH COMPUTING Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, SECTION 2: IN-MEMORY CLUSTER and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International COMPUTING Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 147-156. DOI: https://doi.org/10.1145/2588555.2595641 [Link] • [W4R2] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Sangmi Lee Pallickara Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239-250. DOI: Computer Science, Colorado State University https://doi.org/10.1145/2723372.2742788 [Link] http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • RDD Actions and Persistence • 3. Distributed Computing Models for Scalable Batch Computing • Spark cluster In-Memory Cluster Computing: Apache Spark • RDD dependency RDD: Transformations • Job scheduling RDD: Actions • Closure RDD: Persistence Actions [1/2] Actions [2/2] • collect() • Returns a final value to the driver program • Or writes data to an external storage system • Retrieves entire RDD to the driver • Entire dataset (RDD) should fit in memory on single machine • If the RDD is filtered down to a very small dataset, it is useful • Log file analysis example is continued • take() retrieves a small number of elements in the RDD at the driver program • For very large RDD • Iterates over them locally to print out information at the driver • Store them in the external storage (e.g. S3, or HDFS) • saveAsTextFile() action println(" Input had " + badLinesRDD.count() + " concerning lines") println(" Here are 10 examples:") badLinesRDD. take(10).foreach(println) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara def fold(zeroValue: T)(op: (T, T) ⇒ T): T Aggregate the elements of each partition, reduce() reduce() vs. fold() and then the results for all the partitions, using a given associative function and a neutral "zero value". The function • Takes a function that operates on two elements of the type in your RDD and returns a • Similar to reduce() but it takes ‘zero value’ op(t1, t2) is allowed to modify t1 and return new element of the same type • initial value it as its result value to avoid object allocation; however, it should not modify t2. • The function should be commutative and associative so that it can be computed • The function should be commutative and associative so that it can be computed correctly in parallel correctly in parallel scala> val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) )) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :21 val rdd1 = sc.parallelize(List(1, 2, 5)) scala> rdd1.partitions.length val sum = rdd1.reduce{ (x, y) => x + y} result: res8: Int = 8 //results: sum: Int = 8 scala> val additionalMarks = ("extra", 4) additionalMarks: (String, Int) = (extra,4) scala> val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)} What will be the result(sum)? reduce() vs. fold() take(n) • Similar to reduce() but it takes ‘zero value’ (initial value) • returns n elements from the RDD and attempts to minimize the number of partitions it accesses • The function should be commutative and associative so that it can be computed • It may represent a biased collection correctly in parallel • It does not return the elements in the order you might expect scala> val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) )) • Useful for unit testing rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :21 scala> rdd1.partitions.length result: res8: Int = 8 scala> val additionalMarks = ("extra", 4) additionalMarks: (String, Int) = (extra,4) scala> val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)} // result: sum: (String, Int) = (total,206) // (4x8)+80+90 = 206 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Persistence • Caches dataset across operations • Nodes store any partitions of results from previous operation(s) in memory reuse them in other actions In-Memory Cluster Computing: Apache Spark • An RDD to be persisted can be specified by persist() or cache() RDD: Transformations RDD: Actions • The persisted RDD can be stored using a different storage level RDD: Persistence • Using a StorageLevel object • Passing StorageLevel object to persist() http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Persistence levels level Space CPU In Comment used time memory/On disk High Low Y/N MEMORY_ONLY MEMORY_ONLY_ Low High Y/N Store RDD as serialized Java In-Memory Cluster Computing: Apache Spark SER objects (one byte array per partition). Spark Cluster High Medium Some/Some Spills to disk if there is too much MEMORY_AND_D ISK data to fit in memory MEMORY_AND_D Low High Some/Some Spills to disk if there is too much ISK_SER data to fit in memory. Stores serialized representation in memory DISK_ONLY Low High N/Y CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark cluster and resources Spark cluster [1/3] • Each application gets its own executor processes • Must be up and running for the duration of the entire application Executor Cache Task • Run tasks in multiple threads Driver program Task Cluster Manager • Isolate applications from each other SparkContext • Scheduling side (each driver schedules its own tasks) Executor Cache • Executor side (tasks from different applications run in different JVMs) Hadoop YARN Task Task • Data cannot be shared across different Spark applications (instances of SparkContext) without writing Mesos it to an external storage system Standalone CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark cluster [2/3] Spark cluster [3/3] • Spark is agnostic to the underlying cluster manager • Driver program must listen for and accept incoming connections from its executors throughout its lifetime • As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN) • Driver program must be network addressable from the worker nodes • Driver program should run close to the worker nodes • On the same local area network http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Recommend


More recommend