applied spark
play

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart - PowerPoint PPT Presentation

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org | @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member, Apache Software


  1. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart – ahart@apache.org | @andrewfhart

  2. My Day Job • CTO, Pogoseat • Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2

  3. Additionally… • Member, Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 3

  4. Additionally… • Founder, Data Fluency Software consultancy specializing in appropriate data solutions for startups and smb's Help clients make good decisions and leverage power traditionally accessible only to big business 3/28/16 QCON-SP Andrew Hart 4

  5. Previously… • NASA Jet Propulsion Laboratory Building data management pipelines for research missions in many domains (Climate, Cancer, Mars, Radioastronomy, etc.) 3/28/16 QCON-SP Andrew Hart 5

  6. Apache Spark 3/28/16 QCON-SP Andrew Hart 6

  7. Spark is… • General purpose cluster computing software that maximizes use of cluster memory to process data • Used by hundreds of organizations to realize performance gains over previous-generation cluster compute platforms, particularly for map- reduce style problems 3/28/16 QCON-SP Andrew Hart 7

  8. Spark… • Was developed at the Algorithms, Machines, and People laboratory (AMP Lab) at the University of California, Berkeley in 2009 • Is open source software presently under the governance of the Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 8

  9. Why Spark Exists • Confluence of three trends: Increased volume of digital data Decreasing cost of computer memory (RAM) Data processing technology liberation 3/28/16 QCON-SP Andrew Hart 9

  10. Digital data volume • Early days – low-resolution sensors, comparatively few people on the internet – proprietary data, custom solutions, expensive, custom hardware 3/28/16 QCON-SP Andrew Hart 10

  11. Digital data volume • Modern era – ubiquitous, high-resolution cameras – mobile devices packed with sensors – i.o.t. – open source software, cheap commodity hardware 3/28/16 QCON-SP Andrew Hart 11

  12. We've gone from this Internet… 3/28/16 QCON-SP Andrew Hart 12

  13. To this… (Humanity's global communication platform) 3/28/16 QCON-SP Andrew Hart 13

  14. 1.5 Billion connected PCs 3.2 Billion connected people 6 Billion connected mobile devices 3/28/16 QCON-SP Andrew Hart 14

  15. What are we doing with all of this? Every minute: • 18,000 votes cast on Reddit • 51,000 apps downloaded by Apple users • 350,000 tweets posted on Twitter • 4,100,000 likes recorded on Facebook 3/28/16 QCON-SP Andrew Hart 15

  16. We are awash in data… • Monetizing this data is a core competency for many businesses • Need tools to do this effectively at today's scale 3/28/16 QCON-SP Andrew Hart 16

  17. 2. Tool support and technology liberation 3/28/16 QCON-SP Andrew Hart 17

  18. How far do you want to go back? We have always used tools to help us cope with data 3/28/16 QCON-SP Andrew Hart 18

  19. VisiCalc • Early "big data" tool – Allowed business to move from the chalk board to the digital spreadsheet – Phenomenal increase in productivity running numbers for business 3/28/16 QCON-SP Andrew Hart 19

  20. Modern-era Spreadsheet Tech Microsoft Excel 1,048,576 rows x 16,384 columns 3/28/16 QCON-SP Andrew Hart 20

  21. Open Source Alternatives Exist… Microsoft Excel 1,048,576 rows x 16,384 columns Apache OpenOffice 1,048,576 rows x 1,024 columns 3/28/16 QCON-SP Andrew Hart 21

  22. Relational Database Systems • Support thousands of tables, millions of rows… 3/28/16 QCON-SP Andrew Hart 22

  23. Relational Database Systems • Support thousands of tables, millions of rows… • Viable open source alternatives exist for many use cases 3/28/16 QCON-SP Andrew Hart 23

  24. Modern Big Data Era • MapReduce Algorithm (2004) parallelize large-scale computation across clusters of servers 3/28/16 QCON-SP Andrew Hart 24

  25. Modern Big Data Era • Hadoop Open source processing framework for MapReduce applications to run on large clusters of commodity (unreliable) hardware 3/28/16 QCON-SP Andrew Hart 25

  26. 3. Commoditization of computer memory 3/28/16 QCON-SP Andrew Hart 26

  27. Early Days… • Main memory was hand made. • You could see each bit. 3/28/16 QCON-SP Andrew Hart 27

  28. Modern Era… AWS EC2 r3.8xlarge 244 GiB for US$1.41/hr* 3/28/16 QCON-SP Andrew Hart 28

  29. Why talk about memory? • Ability to use memory efficiently distinguishes Spark from Hadoop, contributes to its speed advantages in many scenarios 3/28/16 QCON-SP Andrew Hart 29

  30. How Spark Works 3/28/16 QCON-SP Andrew Hart 30

  31. Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Immutable (read-only), partitioned dataset • Processed in parallel on each cluster node • Fault-tolerant – resilient to node failure 3/28/16 QCON-SP Andrew Hart 31

  32. Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Uses the distributed memory of the cluster to store the state of a computation as a sharable object across jobs (…instead of serializing to disk) 3/28/16 QCON-SP Andrew Hart 32

  33. Traditional MapReduce: HDFS HDFS HDFS HDFS Read Map-1 Write Read Reduce-1 Write Tuples Data on Tuples Map-2 Reduce-2 Disk on Disk on Disk . . . . . . Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 33

  34. Spark RDD Architecture: HDFS HDFS Read Map-1 Reduce-1 Write Cluster Data on Data on Map-2 Reduce-2 Disk Memory Disk . . . . . . RDD Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 34

  35. Unified computational model: • Spark unifies batch & streaming models, which traditionally require different architectures • Sort of like the limit theorem in Calculus – If you imagine a time-series set of RDDs with ever smaller windows of time, you can approximate streaming workflows 3/28/16 QCON-SP Andrew Hart 35

  36. Two ways to create an RDD in Spark Programs: • "Parallelize" an existing collection (e.g.: Python array) in the driver program • Reference a dataset on external storage – text files on disk – anything with a supported Hadoop InputFormat 3/28/16 QCON-SP Andrew Hart 36

  37. RDDs from RDDs: • RDDs are immutable (read-only) • The result of applying transformations and actions to an RDD is a new RDD • RDD's can be persisted to memory – reducing need to re-compute RDD every time 3/28/16 QCON-SP Andrew Hart 37

  38. Writing Spark Programs: Think of spark programs as a way to describe a sequence of transformations and actions that should be applied to an RDD 3/28/16 QCON-SP Andrew Hart 38

  39. Writing Spark Programs: • Transformations create a new dataset from an existing one (e.g.: map) • Actions return a value after running a computation (e.g.: reduce) 3/28/16 QCON-SP Andrew Hart 39

  40. Writing Spark Programs: • Spark provides a rich set of transformations: map flatMap filter sample union intersection distinct groupByKey sortByKey cogroup pipe join cartesian coalesce 3/28/16 QCON-SP Andrew Hart 40

  41. Writing Spark Programs: • Spark provides a rich set of actions: reduce collect count first take takeSample takeOrdered saveAsTextFile countByKey foreach 3/28/16 QCON-SP Andrew Hart 41

  42. Writing Spark Programs: • Transformations are lazily evaluated . They are only computed when a subsequent action (which must return a result) is run. • Transformations get recomputed each time an action is run (unless you explicitly persist the resulting RDD to memory) 3/28/16 QCON-SP Andrew Hart 42

  43. Structure of Spark Programs: Spark programs have two principal components: • Driver Program • Worker function 3/28/16 QCON-SP Andrew Hart 43

  44. Structure of Spark Programs: Driver program • Executes on the master node • Establishes context 3/28/16 QCON-SP Andrew Hart 44

  45. Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 45

  46. Structure of Spark Programs: SparkContext • Holds all of the information about the cluster • Manages what gets shipped to nodes • Makes life extremely easy for developers 3/28/16 QCON-SP Andrew Hart 46

  47. Structure of Spark Programs: Shared Variables • Broadcast variables: efficiently share static data with the cluster nodes • Accumulators : write-only variables that serve as counters 3/28/16 QCON-SP Andrew Hart 47

  48. Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 48

  49. Interacting with Spark 3/28/16 QCON-SP Andrew Hart 49

  50. Spark APIs: Spark provides APIS for several languages • Scala • Java • Python • R • SQL 3/28/16 QCON-SP Andrew Hart 50

  51. Using Spark: There are two main ways to leverage Spark • Interactively through the included command line interface • Programmatically via standalone programs submitted as jobs to the master node 3/28/16 QCON-SP Andrew Hart 51

Recommend


More recommend