Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart – ahart@apache.org | @andrewfhart
My Day Job • CTO, Pogoseat • Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2
Additionally… • Member, Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 3
Additionally… • Founder, Data Fluency Software consultancy specializing in appropriate data solutions for startups and smb's Help clients make good decisions and leverage power traditionally accessible only to big business 3/28/16 QCON-SP Andrew Hart 4
Previously… • NASA Jet Propulsion Laboratory Building data management pipelines for research missions in many domains (Climate, Cancer, Mars, Radioastronomy, etc.) 3/28/16 QCON-SP Andrew Hart 5
Apache Spark 3/28/16 QCON-SP Andrew Hart 6
Spark is… • General purpose cluster computing software that maximizes use of cluster memory to process data • Used by hundreds of organizations to realize performance gains over previous-generation cluster compute platforms, particularly for map- reduce style problems 3/28/16 QCON-SP Andrew Hart 7
Spark… • Was developed at the Algorithms, Machines, and People laboratory (AMP Lab) at the University of California, Berkeley in 2009 • Is open source software presently under the governance of the Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 8
Why Spark Exists • Confluence of three trends: Increased volume of digital data Decreasing cost of computer memory (RAM) Data processing technology liberation 3/28/16 QCON-SP Andrew Hart 9
Digital data volume • Early days – low-resolution sensors, comparatively few people on the internet – proprietary data, custom solutions, expensive, custom hardware 3/28/16 QCON-SP Andrew Hart 10
Digital data volume • Modern era – ubiquitous, high-resolution cameras – mobile devices packed with sensors – i.o.t. – open source software, cheap commodity hardware 3/28/16 QCON-SP Andrew Hart 11
We've gone from this Internet… 3/28/16 QCON-SP Andrew Hart 12
To this… (Humanity's global communication platform) 3/28/16 QCON-SP Andrew Hart 13
1.5 Billion connected PCs 3.2 Billion connected people 6 Billion connected mobile devices 3/28/16 QCON-SP Andrew Hart 14
What are we doing with all of this? Every minute: • 18,000 votes cast on Reddit • 51,000 apps downloaded by Apple users • 350,000 tweets posted on Twitter • 4,100,000 likes recorded on Facebook 3/28/16 QCON-SP Andrew Hart 15
We are awash in data… • Monetizing this data is a core competency for many businesses • Need tools to do this effectively at today's scale 3/28/16 QCON-SP Andrew Hart 16
2. Tool support and technology liberation 3/28/16 QCON-SP Andrew Hart 17
How far do you want to go back? We have always used tools to help us cope with data 3/28/16 QCON-SP Andrew Hart 18
VisiCalc • Early "big data" tool – Allowed business to move from the chalk board to the digital spreadsheet – Phenomenal increase in productivity running numbers for business 3/28/16 QCON-SP Andrew Hart 19
Modern-era Spreadsheet Tech Microsoft Excel 1,048,576 rows x 16,384 columns 3/28/16 QCON-SP Andrew Hart 20
Open Source Alternatives Exist… Microsoft Excel 1,048,576 rows x 16,384 columns Apache OpenOffice 1,048,576 rows x 1,024 columns 3/28/16 QCON-SP Andrew Hart 21
Relational Database Systems • Support thousands of tables, millions of rows… 3/28/16 QCON-SP Andrew Hart 22
Relational Database Systems • Support thousands of tables, millions of rows… • Viable open source alternatives exist for many use cases 3/28/16 QCON-SP Andrew Hart 23
Modern Big Data Era • MapReduce Algorithm (2004) parallelize large-scale computation across clusters of servers 3/28/16 QCON-SP Andrew Hart 24
Modern Big Data Era • Hadoop Open source processing framework for MapReduce applications to run on large clusters of commodity (unreliable) hardware 3/28/16 QCON-SP Andrew Hart 25
3. Commoditization of computer memory 3/28/16 QCON-SP Andrew Hart 26
Early Days… • Main memory was hand made. • You could see each bit. 3/28/16 QCON-SP Andrew Hart 27
Modern Era… AWS EC2 r3.8xlarge 244 GiB for US$1.41/hr* 3/28/16 QCON-SP Andrew Hart 28
Why talk about memory? • Ability to use memory efficiently distinguishes Spark from Hadoop, contributes to its speed advantages in many scenarios 3/28/16 QCON-SP Andrew Hart 29
How Spark Works 3/28/16 QCON-SP Andrew Hart 30
Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Immutable (read-only), partitioned dataset • Processed in parallel on each cluster node • Fault-tolerant – resilient to node failure 3/28/16 QCON-SP Andrew Hart 31
Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Uses the distributed memory of the cluster to store the state of a computation as a sharable object across jobs (…instead of serializing to disk) 3/28/16 QCON-SP Andrew Hart 32
Traditional MapReduce: HDFS HDFS HDFS HDFS Read Map-1 Write Read Reduce-1 Write Tuples Data on Tuples Map-2 Reduce-2 Disk on Disk on Disk . . . . . . Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 33
Spark RDD Architecture: HDFS HDFS Read Map-1 Reduce-1 Write Cluster Data on Data on Map-2 Reduce-2 Disk Memory Disk . . . . . . RDD Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 34
Unified computational model: • Spark unifies batch & streaming models, which traditionally require different architectures • Sort of like the limit theorem in Calculus – If you imagine a time-series set of RDDs with ever smaller windows of time, you can approximate streaming workflows 3/28/16 QCON-SP Andrew Hart 35
Two ways to create an RDD in Spark Programs: • "Parallelize" an existing collection (e.g.: Python array) in the driver program • Reference a dataset on external storage – text files on disk – anything with a supported Hadoop InputFormat 3/28/16 QCON-SP Andrew Hart 36
RDDs from RDDs: • RDDs are immutable (read-only) • The result of applying transformations and actions to an RDD is a new RDD • RDD's can be persisted to memory – reducing need to re-compute RDD every time 3/28/16 QCON-SP Andrew Hart 37
Writing Spark Programs: Think of spark programs as a way to describe a sequence of transformations and actions that should be applied to an RDD 3/28/16 QCON-SP Andrew Hart 38
Writing Spark Programs: • Transformations create a new dataset from an existing one (e.g.: map) • Actions return a value after running a computation (e.g.: reduce) 3/28/16 QCON-SP Andrew Hart 39
Writing Spark Programs: • Spark provides a rich set of transformations: map flatMap filter sample union intersection distinct groupByKey sortByKey cogroup pipe join cartesian coalesce 3/28/16 QCON-SP Andrew Hart 40
Writing Spark Programs: • Spark provides a rich set of actions: reduce collect count first take takeSample takeOrdered saveAsTextFile countByKey foreach 3/28/16 QCON-SP Andrew Hart 41
Writing Spark Programs: • Transformations are lazily evaluated . They are only computed when a subsequent action (which must return a result) is run. • Transformations get recomputed each time an action is run (unless you explicitly persist the resulting RDD to memory) 3/28/16 QCON-SP Andrew Hart 42
Structure of Spark Programs: Spark programs have two principal components: • Driver Program • Worker function 3/28/16 QCON-SP Andrew Hart 43
Structure of Spark Programs: Driver program • Executes on the master node • Establishes context 3/28/16 QCON-SP Andrew Hart 44
Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 45
Structure of Spark Programs: SparkContext • Holds all of the information about the cluster • Manages what gets shipped to nodes • Makes life extremely easy for developers 3/28/16 QCON-SP Andrew Hart 46
Structure of Spark Programs: Shared Variables • Broadcast variables: efficiently share static data with the cluster nodes • Accumulators : write-only variables that serve as counters 3/28/16 QCON-SP Andrew Hart 47
Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 48
Interacting with Spark 3/28/16 QCON-SP Andrew Hart 49
Spark APIs: Spark provides APIS for several languages • Scala • Java • Python • R • SQL 3/28/16 QCON-SP Andrew Hart 50
Using Spark: There are two main ways to leverage Spark • Interactively through the included command line interface • Programmatically via standalone programs submitted as jobs to the master node 3/28/16 QCON-SP Andrew Hart 51
Recommend
More recommend