Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala
Agenda Introduction to Spark Case-study: Recommender system for scientific papers Organization Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 2
Agenda Introduction to Spark Case-study: Recommender system for scientific papers Organization Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 3
Introduction to Spark Distributed programming MapReduce Spark 18.10.2017 Distributed Computing Using Spark WS17/18 4
Distributed programming - problem Data grows faster than processing capabilities - Web 2.0: users generate content - Social networks, online communities, etc. Source: https://www.flickr.com/photos/will-lion/2595497078 18.10.2017 Distributed Computing Using Spark WS17/18 5
Big Data Source: http://www.bigdata-startups.com/open-source-tools/ Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 18.10.2017 Distributed Computing Using Spark WS17/18 6
Big Data Buzzword Often less-structured Requires different techniques, tools, approaches - To solve new problems or old ones in a better way 18.10.2017 Distributed Computing Using Spark WS17/18 7
Network Programming Models Requires a communication protocol for programming parallel computers (slow) - MPI (wiki) Locality of the data and the code across the network have to be done manually No failure management Network problems not solved (e.g. stragglers) 18.10.2017 Distributed Computing Using Spark WS17/18 8
Data Flow Models Higher-level of abstraction: algorithms are parallelized on large clusters Fault-recovery by means of data replication Job divided into a set of independent tasks - Code is shipped to where the data is located Good scalability 18.10.2017 Distributed Computing Using Spark WS17/18 9
MapReduce – Key ideas 1. Problem is split into smaller problems (map step) 2. Smaller problems are solved in a parallel fashion 3. Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step) 18.10.2017 Distributed Computing Using Spark WS17/18 10
MapReduce – Overview split 0 Map Input Data Reduce output 0 split 1 Map Reduce output 1 Map split 2 … <k,v> Data A target problem has to be parallelizable!!! 18.10.2017 Distributed Computing Using Spark WS17/18 11
MapReduce – Wordcount example Google Maps charts new territory into businesses Google 4 Maps 4 Google selling new tools for businesses to build their own Businesses 4 maps Engine 1 Charts 1 Google promises consumer experience for businesses with Territory 1 Maps Engine Pro Tools 1 … Google is trying to get its Maps service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 12
MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 13
MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 14
MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Reduce Maps 4 Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Charts 1 Reduce Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 15
MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Maps 4 Reduce Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Reduce Charts 1 Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 16
MapReduce Automatic - Partition and distribution of data - Parallelization and assignment of tasks - Scalability, fault-tolerance, scheduling 18.10.2017 Distributed Computing Using Spark WS17/18 17
Apache Hadoop Open-source implementation of MapReduce Source: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php 18.10.2017 Distributed Computing Using Spark WS17/18 18
MapReduce – Parallelizable algorithms Matrix-vector multiplication Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Matrix Factorization (Tall skinny QR) etc … 18.10.2017 Distributed Computing Using Spark WS17/18 19
MapReduce – Limitations Inefficient for multi-pass algorithms No efficient primitives for data sharing State between steps is materialized and distributed Slow due to replication and storage Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 20
Limitations – PageRank Requires iterations of multiplications of sparse matrix and vector Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 21
Limitations – PageRank MapReduce sometime requires asymptotically more communication or I/O Iterations are handled very poorly Reading and writing to disk is a bottleneck - In some cases 90% of time is spent on I/O 18.10.2017 Distributed Computing Using Spark WS17/18 22
Spark Processing Framework Developed in 2009 in UC Berkeley’s In 2010 open sourced at Apache - Most active big data community - Industrial contributions: over 50 companies Written in Scala - Good at serializing closures Clean APIs in Java, Scala, Python, R 18.10.2017 Distributed Computing Using Spark WS17/18 23
Spark Processing Framework Contributors (2014) 18.10.2017 Distributed Computing Using Spark WS17/18 24
Spark – High Level Architecture HDFS Source: https://mapr.com/ebooks/spark/ 18.10.2017 Distributed Computing Using Spark WS17/18 25
Spark - Running modes Local mode: for debugging Cluster mode - Standalone mode - Apache Mesos - Hadoop Yarn 18.10.2017 Distributed Computing Using Spark WS17/18 26
Spark – Programming model Spark context: the entry point Spark Session: since Spark 2.0 - New unified entry point. It combines SQLContext, HiveContext and future StreamingContex Spark Conf: to initialize the context Spark’s interactive shell - Scala: spark-shell - Python: pyspark 18.10.2017 Distributed Computing Using Spark WS17/18 27
Spark – RDDs, the game changer Resilient distributed datasets A typed data-structure ( RDD[ T ] ) that is not language specific Each element of type T is stored locally on a machine - It has to fit in memory An RDD can be cached in memory 18.10.2017 Distributed Computing Using Spark WS17/18 28
Resilient Distributed Datasets Immutable collections of objects, spread across cluster User controlled partitioning and storage Automatically rebuilt on failure RDDs are replaced by Dataset, which is strongly-typed like an RDD (Spark > 2.0) 18.10.2017 Distributed Computing Using Spark WS17/18 29
Spark – Wordcount example text_file = sc.textFile("...") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("...") http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext 18.10.2017 Distributed Computing Using Spark WS17/18 30
Spark – Data manipulation Transformations: always yield a new RDD instance (RDDs are immutable) - filter , map , flatMap, etc. Actions: triggers a computation on the RDD’s elements - count , foreach, etc. Lazy evaluation of transformations 18.10.2017 Distributed Computing Using Spark WS17/18 31
Spark – DataFrames DataFrame API introduced since Spark 1.3 Handles table-like representation with named columns and declared column types Do not confuse with Python’s Pandas DataFrames DataFrames translate SQL code into RDD low- level operations Since Spark 2.0, DataFrame is implemented as a special case of DataSet 18.10.2017 Distributed Computing Using Spark WS17/18 32
DataFrames – How to create DFs 1. Convert existing RDDs 2. Running SQL queries 3. Loading external data 18.10.2017 Distributed Computing Using Spark WS17/18 33
Spark SQL SQL context // Run SQL statements. Returns a DataFrame students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’) http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html 18.10.2017 Distributed Computing Using Spark WS17/18 34
Spark – DataFrames Source: Spark in Action (book, see literature) 18.10.2017 Distributed Computing Using Spark WS17/18 35
Recommend
More recommend