distributed computing using spark
play

Distributed Computing Using Spark Practical / Praktikum WS17/18 - PowerPoint PPT Presentation

Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universitt Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala Agenda Introduction to Spark Case-study:


  1. Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala

  2. Agenda  Introduction to Spark  Case-study: Recommender system for scientific papers  Organization  Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 2

  3. Agenda  Introduction to Spark  Case-study: Recommender system for scientific papers  Organization  Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 3

  4. Introduction to Spark  Distributed programming  MapReduce  Spark 18.10.2017 Distributed Computing Using Spark WS17/18 4

  5. Distributed programming - problem  Data grows faster than processing capabilities - Web 2.0: users generate content - Social networks, online communities, etc. Source: https://www.flickr.com/photos/will-lion/2595497078 18.10.2017 Distributed Computing Using Spark WS17/18 5

  6. Big Data Source: http://www.bigdata-startups.com/open-source-tools/ Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 18.10.2017 Distributed Computing Using Spark WS17/18 6

  7. Big Data  Buzzword  Often less-structured  Requires different techniques, tools, approaches - To solve new problems or old ones in a better way 18.10.2017 Distributed Computing Using Spark WS17/18 7

  8. Network Programming Models  Requires a communication protocol for programming parallel computers (slow) - MPI (wiki)  Locality of the data and the code across the network have to be done manually  No failure management  Network problems not solved (e.g. stragglers) 18.10.2017 Distributed Computing Using Spark WS17/18 8

  9. Data Flow Models  Higher-level of abstraction: algorithms are parallelized on large clusters  Fault-recovery by means of data replication  Job divided into a set of independent tasks - Code is shipped to where the data is located  Good scalability 18.10.2017 Distributed Computing Using Spark WS17/18 9

  10. MapReduce – Key ideas 1. Problem is split into smaller problems (map step) 2. Smaller problems are solved in a parallel fashion 3. Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step) 18.10.2017 Distributed Computing Using Spark WS17/18 10

  11. MapReduce – Overview split 0 Map Input Data Reduce output 0 split 1 Map Reduce output 1 Map split 2 … <k,v> Data A target problem has to be parallelizable!!! 18.10.2017 Distributed Computing Using Spark WS17/18 11

  12. MapReduce – Wordcount example Google Maps charts new territory into businesses Google 4 Maps 4 Google selling new tools for businesses to build their own Businesses 4 maps Engine 1 Charts 1 Google promises consumer experience for businesses with Territory 1 Maps Engine Pro Tools 1 … Google is trying to get its Maps service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 12

  13. MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 13

  14. MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 14

  15. MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Reduce Maps 4 Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Charts 1 Reduce Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 15

  16. MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Maps 4 Reduce Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Reduce Charts 1 Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 16

  17. MapReduce  Automatic - Partition and distribution of data - Parallelization and assignment of tasks - Scalability, fault-tolerance, scheduling 18.10.2017 Distributed Computing Using Spark WS17/18 17

  18. Apache Hadoop  Open-source implementation of MapReduce Source: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php 18.10.2017 Distributed Computing Using Spark WS17/18 18

  19. MapReduce – Parallelizable algorithms  Matrix-vector multiplication  Power iteration (e.g. PageRank)  Gradient descent methods  Stochastic SVD  Matrix Factorization (Tall skinny QR)  etc … 18.10.2017 Distributed Computing Using Spark WS17/18 19

  20. MapReduce – Limitations  Inefficient for multi-pass algorithms  No efficient primitives for data sharing  State between steps is materialized and distributed  Slow due to replication and storage Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 20

  21. Limitations – PageRank  Requires iterations of multiplications of sparse matrix and vector Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 21

  22. Limitations – PageRank  MapReduce sometime requires asymptotically more communication or I/O  Iterations are handled very poorly  Reading and writing to disk is a bottleneck - In some cases 90% of time is spent on I/O 18.10.2017 Distributed Computing Using Spark WS17/18 22

  23. Spark Processing Framework  Developed in 2009 in UC Berkeley’s  In 2010 open sourced at Apache - Most active big data community - Industrial contributions: over 50 companies  Written in Scala - Good at serializing closures  Clean APIs in Java, Scala, Python, R 18.10.2017 Distributed Computing Using Spark WS17/18 23

  24. Spark Processing Framework Contributors (2014) 18.10.2017 Distributed Computing Using Spark WS17/18 24

  25. Spark – High Level Architecture HDFS Source: https://mapr.com/ebooks/spark/ 18.10.2017 Distributed Computing Using Spark WS17/18 25

  26. Spark - Running modes  Local mode: for debugging  Cluster mode - Standalone mode - Apache Mesos - Hadoop Yarn 18.10.2017 Distributed Computing Using Spark WS17/18 26

  27. Spark – Programming model  Spark context: the entry point  Spark Session: since Spark 2.0 - New unified entry point. It combines SQLContext, HiveContext and future StreamingContex  Spark Conf: to initialize the context  Spark’s interactive shell - Scala: spark-shell - Python: pyspark 18.10.2017 Distributed Computing Using Spark WS17/18 27

  28. Spark – RDDs, the game changer  Resilient distributed datasets  A typed data-structure ( RDD[ T ] ) that is not language specific  Each element of type T is stored locally on a machine - It has to fit in memory  An RDD can be cached in memory 18.10.2017 Distributed Computing Using Spark WS17/18 28

  29. Resilient Distributed Datasets  Immutable collections of objects, spread across cluster  User controlled partitioning and storage  Automatically rebuilt on failure  RDDs are replaced by Dataset, which is strongly-typed like an RDD (Spark > 2.0) 18.10.2017 Distributed Computing Using Spark WS17/18 29

  30. Spark – Wordcount example text_file = sc.textFile("...") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("...") http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext 18.10.2017 Distributed Computing Using Spark WS17/18 30

  31. Spark – Data manipulation  Transformations: always yield a new RDD instance (RDDs are immutable) - filter , map , flatMap, etc.  Actions: triggers a computation on the RDD’s elements - count , foreach, etc.  Lazy evaluation of transformations 18.10.2017 Distributed Computing Using Spark WS17/18 31

  32. Spark – DataFrames  DataFrame API introduced since Spark 1.3  Handles table-like representation with named columns and declared column types  Do not confuse with Python’s Pandas DataFrames  DataFrames translate SQL code into RDD low- level operations  Since Spark 2.0, DataFrame is implemented as a special case of DataSet 18.10.2017 Distributed Computing Using Spark WS17/18 32

  33. DataFrames – How to create DFs 1. Convert existing RDDs 2. Running SQL queries 3. Loading external data 18.10.2017 Distributed Computing Using Spark WS17/18 33

  34. Spark SQL  SQL context // Run SQL statements. Returns a DataFrame students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’) http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html 18.10.2017 Distributed Computing Using Spark WS17/18 34

  35. Spark – DataFrames Source: Spark in Action (book, see literature) 18.10.2017 Distributed Computing Using Spark WS17/18 35

Recommend


More recommend