apache spark a unified engine for big data processing
play

Apache Spark: A Unified Engine for Big Data Processing Presented - PowerPoint PPT Presentation

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A


  1. Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen

  2. Apache Spark: A Unified Engine for Big Data Processing § Engine? § Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2

  3. Apache Spark: A Unified Engine for Big Data Processing § Engine? § convert one form of data into other useful forms § Unified? § Multiple types of conversions Apache Spark: A Unified Engine for Big Data Processing PAGE 3

  4. Apache Spark: A Unified Engine for Big Data Processing § What is Apache Spark? (Engine) § How can it make multiple types of conversions over big data? (Unified) Apache Spark: A Unified Engine for Big Data Processing PAGE 4

  5. What is Apache Spark? A framework like MapReduce § Resilient Distributed Datasets (RDDs) § RDDs Apache Spark: A Unified Engine for Big Data Processing PAGE 5

  6. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 6

  7. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 7

  8. Resilient Distributed Datasets (RDDs) I/O I/O Apache Spark: A Unified Engine for Big Data Processing PAGE 8

  9. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 9

  10. Resilient Distributed Datasets (RDDs) An RDD is a read-only, partitioned collection of records § Transformations § § create RDDs (map, filter, join, etc.) Actions § § return a value to the application § or export data to a storage system Persistence § § Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). Partitioning § § Users can ask that an RDD’s elements be partitioned across machines based on a key in each record. Apache Spark: A Unified Engine for Big Data Processing PAGE 10

  11. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 11

  12. Resilient Distributed Datasets (RDDs) § Lineage § An RDD has enough information about how it was derived from other datasets. § Narrow dependencies § each partition of the parent RDD is used by at most one partition of the child RDD § Wide dependencies: § multiple child partitions Apache Spark: A Unified Engine for Big Data Processing PAGE 12

  13. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 13

  14. Resilient Distributed Datasets (RDDs) Example: run an action on RDD G Apache Spark: A Unified Engine for Big Data Processing PAGE 14

  15. MapReduce vs Spark MapReduce Ecosystem Spark Ecosystem Apache Spark: A Unified Engine for Big Data Processing PAGE 15

  16. Higher-Level Libraries Apache Spark: A Unified Engine for Big Data Processing PAGE 16

  17. SQL and DataFrames Apache Spark: A Unified Engine for Big Data Processing PAGE 17

  18. SQL and DataFrames § !"#"$%"&'( = *!!( + ,-ℎ'&" = /"01'( § Spark SQL’s DataFrame API supports inline definition of user-defined functions (UDFs), without the complicated packaging and registration process found in other database systems. Apache Spark: A Unified Engine for Big Data Processing PAGE 18

  19. UDF in MySQL Apache Spark: A Unified Engine for Big Data Processing PAGE 19

  20. UDF in Spark SQL Apache Spark: A Unified Engine for Big Data Processing PAGE 20

  21. Spark Streaming Apache Spark: A Unified Engine for Big Data Processing PAGE 21

  22. Spark Streaming Discretized stream processing model Continuous operator processing model Apache Spark: A Unified Engine for Big Data Processing PAGE 22

  23. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 23

  24. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 24

  25. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 25

  26. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 26

  27. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 27

  28. GraphX § Not able to beat specialized graph-parallel systems itself § But outperform them in graph analytics pipeline Apache Spark: A Unified Engine for Big Data Processing PAGE 28

  29. MLlib Apache Spark: A Unified Engine for Big Data Processing PAGE 29

  30. MLlib § More than 50 common algorithms for distributed model training § Support pipeline construction on Spark § Integrate with other Spark libraries well Apache Spark: A Unified Engine for Big Data Processing PAGE 30

  31. Why use Apache Spark? § Ecosystem § Competitive performance § Low cost in sharing data § Low latency of MapReduce Steps § Control over bottleneck resources Apache Spark: A Unified Engine for Big Data Processing PAGE 31

  32. Apache Spark in 2016 § Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs. § Apache Spark has grown to 1,000 contributors and thousands of deployments from 2010 to 2016. Apache Spark: A Unified Engine for Big Data Processing PAGE 32

  33. Apache Spark Today Apache Spark: A Unified Engine for Big Data Processing PAGE 33

  34. Apache Spark: A Unified Engine for Big Data Processing § What is Apache Spark § Apache Spark = MapReduce + RDDs § How can it make multiple types of conversions over big data § Higher-level libraries enable Apache Spark to handle different types of big data workload Apache Spark: A Unified Engine for Big Data Processing PAGE 34

  35. “Try Apache Spark if you are new to the big data processing world” Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing PAGE 35

  36. Q&A § What issues will it cause by persisting data in memory? For example, garbage collection? § What are Parallel Random Access Machine model and Bulk Synchronous Parallel model? Are these two models able to model any computation in distributed world? § Will optimizing one library cause other libraries to lose performance? § Is using memory as the storage really the next generation of storage? Apache Spark: A Unified Engine for Big Data Processing PAGE 36

Recommend


More recommend