Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart - PowerPoint PPT Presentation

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart – ahart@apache.org | @andrewfhart

My Day Job • CTO, Pogoseat • Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2

Additionally… • Member, Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 3

Additionally… • Founder, Data Fluency Software consultancy specializing in appropriate data solutions for startups and smb's Help clients make good decisions and leverage power traditionally accessible only to big business 3/28/16 QCON-SP Andrew Hart 4

Previously… • NASA Jet Propulsion Laboratory Building data management pipelines for research missions in many domains (Climate, Cancer, Mars, Radioastronomy, etc.) 3/28/16 QCON-SP Andrew Hart 5

Apache Spark 3/28/16 QCON-SP Andrew Hart 6

Spark is… • General purpose cluster computing software that maximizes use of cluster memory to process data • Used by hundreds of organizations to realize performance gains over previous-generation cluster compute platforms, particularly for mapreduce style problems 3/28/16 QCON-SP Andrew Hart 7

Spark… • Was developed at the Algorithms, Machines, and People laboratory (AMP Lab) at the University of California, Berkeley in 2009 • Is open source software presently under the governance of the Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 8

Why Spark Exists • Confluence of three trends: Increased volume of digital data Decreasing cost of computer memory (RAM) Data processing technology liberation 3/28/16 QCON-SP Andrew Hart 9

Digital data volume • Early days – low-resolution sensors, comparatively few people on the internet – proprietary data, custom solutions, expensive, custom hardware 3/28/16 QCON-SP Andrew Hart 10

Digital data volume • Modern era – ubiquitous, high-resolution cameras – mobile devices packed with sensors – i.o.t. – open source software, cheap commodity hardware 3/28/16 QCON-SP Andrew Hart 11

We've gone from this Internet… 3/28/16 QCON-SP Andrew Hart 12

To this… (Humanity's global communication platform) 3/28/16 QCON-SP Andrew Hart 13

1.5 Billion connected PCs 3.2 Billion connected people 6 Billion connected mobile devices 3/28/16 QCON-SP Andrew Hart 14

What are we doing with all of this? Every minute: • 18,000 votes cast on Reddit • 51,000 apps downloaded by Apple users • 350,000 tweets posted on Twitter • 4,100,000 likes recorded on Facebook 3/28/16 QCON-SP Andrew Hart 15

We are awash in data… • Monetizing this data is a core competency for many businesses • Need tools to do this effectively at today's scale 3/28/16 QCON-SP Andrew Hart 16

2. Tool support and technology liberation 3/28/16 QCON-SP Andrew Hart 17

How far do you want to go back? We have always used tools to help us cope with data 3/28/16 QCON-SP Andrew Hart 18

VisiCalc • Early "big data" tool – Allowed business to move from the chalk board to the digital spreadsheet – Phenomenal increase in productivity running numbers for business 3/28/16 QCON-SP Andrew Hart 19

Modern-era Spreadsheet Tech Microsoft Excel 1,048,576 rows x 16,384 columns 3/28/16 QCON-SP Andrew Hart 20

Open Source Alternatives Exist… Microsoft Excel 1,048,576 rows x 16,384 columns Apache OpenOffice 1,048,576 rows x 1,024 columns 3/28/16 QCON-SP Andrew Hart 21

Relational Database Systems • Support thousands of tables, millions of rows… 3/28/16 QCON-SP Andrew Hart 22

Relational Database Systems • Support thousands of tables, millions of rows… • Viable open source alternatives exist for many use cases 3/28/16 QCON-SP Andrew Hart 23

Modern Big Data Era • MapReduce Algorithm (2004) parallelize large-scale computation across clusters of servers 3/28/16 QCON-SP Andrew Hart 24

Modern Big Data Era • Hadoop Open source processing framework for MapReduce applications to run on large clusters of commodity (unreliable) hardware 3/28/16 QCON-SP Andrew Hart 25

3. Commoditization of computer memory 3/28/16 QCON-SP Andrew Hart 26

Early Days… • Main memory was hand made. • You could see each bit. 3/28/16 QCON-SP Andrew Hart 27

Modern Era… AWS EC2 r3.8xlarge 244 GiB for US$1.41/hr* 3/28/16 QCON-SP Andrew Hart 28

Why talk about memory? • Ability to use memory efficiently distinguishes Spark from Hadoop, contributes to its speed advantages in many scenarios 3/28/16 QCON-SP Andrew Hart 29

How Spark Works 3/28/16 QCON-SP Andrew Hart 30

Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Immutable (read-only), partitioned dataset • Processed in parallel on each cluster node • Fault-tolerant – resilient to node failure 3/28/16 QCON-SP Andrew Hart 31

Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Uses the distributed memory of the cluster to store the state of a computation as a sharable object across jobs (…instead of serializing to disk) 3/28/16 QCON-SP Andrew Hart 32

Traditional MapReduce: HDFS HDFS HDFS HDFS Read Map-1 Write Read Reduce-1 Write Tuples Data on Tuples Map-2 Reduce-2 Disk on Disk on Disk . . . . . . Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 33

Spark RDD Architecture: HDFS HDFS Read Map-1 Reduce-1 Write Cluster Data on Data on Map-2 Reduce-2 Disk Memory Disk . . . . . . RDD Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 34

Unified computational model: • Spark unifies batch & streaming models, which traditionally require different architectures • Sort of like the limit theorem in Calculus – If you imagine a time-series set of RDDs with ever smaller windows of time, you can approximate streaming workflows 3/28/16 QCON-SP Andrew Hart 35

Two ways to create an RDD in Spark Programs: • "Parallelize" an existing collection (e.g.: Python array) in the driver program • Reference a dataset on external storage – text files on disk – anything with a supported Hadoop InputFormat 3/28/16 QCON-SP Andrew Hart 36

RDDs from RDDs: • RDDs are immutable (read-only) • The result of applying transformations and actions to an RDD is a new RDD • RDD's can be persisted to memory – reducing need to re-compute RDD every time 3/28/16 QCON-SP Andrew Hart 37

Writing Spark Programs: Think of spark programs as a way to describe a sequence of transformations and actions that should be applied to an RDD 3/28/16 QCON-SP Andrew Hart 38

Writing Spark Programs: • Transformations create a new dataset from an existing one (e.g.: map) • Actions return a value after running a computation (e.g.: reduce) 3/28/16 QCON-SP Andrew Hart 39

Writing Spark Programs: • Spark provides a rich set of transformations: map flatMap filter sample union intersection distinct groupByKey sortByKey cogroup pipe join cartesian coalesce 3/28/16 QCON-SP Andrew Hart 40

Writing Spark Programs: • Spark provides a rich set of actions: reduce collect count first take takeSample takeOrdered saveAsTextFile countByKey foreach 3/28/16 QCON-SP Andrew Hart 41

Writing Spark Programs: • Transformations are lazily evaluated . They are only computed when a subsequent action (which must return a result) is run. • Transformations get recomputed each time an action is run (unless you explicitly persist the resulting RDD to memory) 3/28/16 QCON-SP Andrew Hart 42

Structure of Spark Programs: Spark programs have two principal components: • Driver Program • Worker function 3/28/16 QCON-SP Andrew Hart 43

Structure of Spark Programs: Driver program • Executes on the master node • Establishes context 3/28/16 QCON-SP Andrew Hart 44

Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 45

Structure of Spark Programs: SparkContext • Holds all of the information about the cluster • Manages what gets shipped to nodes • Makes life extremely easy for developers 3/28/16 QCON-SP Andrew Hart 46

Structure of Spark Programs: Shared Variables • Broadcast variables: efficiently share static data with the cluster nodes • Accumulators : write-only variables that serve as counters 3/28/16 QCON-SP Andrew Hart 47

Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 48

Interacting with Spark 3/28/16 QCON-SP Andrew Hart 49

Spark APIs: Spark provides APIS for several languages • Scala • Java • Python • R • SQL 3/28/16 QCON-SP Andrew Hart 50

Using Spark: There are two main ways to leverage Spark • Interactively through the included command line interface • Programmatically via standalone programs submitted as jobs to the master node 3/28/16 QCON-SP Andrew Hart 51

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart - PowerPoint PPT Presentation

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org | @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member, Apache Software

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

computers to personal computers Xerox does it all 1973: Xerox Alto GUI wysiwyg mouse ethernet

Noosa Mining Virtual 16 July 2020 A Sustainable Company Revenue Earning Long Life Mine

XENAPP & NVIDIA GRID THOMAS POPPELGAARD WHO AM I Thomas Poppelgaard, Technology Evangelist

Q2 2015 Investor Presentation Q3 2014 Investor Presentation Global Partners LP (NYSE: GLP)

EARNINGS TRADES 20180523 RDO SFFxO Traders Meetup (Raleigh Durham Open Stocks Futures FX

EARNINGS TRADES PART 2 20190819 RDO SFFxO Traders Meetup (Raleigh Durham Open Stocks

PRO WPF IN VB 2010: WINDOWS PRESENTATION FOUNDATION IN .NET 4 FREE DOWNLOAD Author: Matthew

Pro WPF in VB 2010: Windows Presentation Foundation Pro WPF in VB 2010: Windows Presentation

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart - PowerPoint PPT Presentation

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org | @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member, Apache Software

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

computers to personal computers Xerox does it all 1973: Xerox Alto GUI wysiwyg mouse ethernet

Noosa Mining Virtual 16 July 2020 A Sustainable Company Revenue Earning Long Life Mine

XENAPP &amp; NVIDIA GRID THOMAS POPPELGAARD WHO AM I Thomas Poppelgaard, Technology Evangelist

Q2 2015 Investor Presentation Q3 2014 Investor Presentation Global Partners LP (NYSE: GLP)

EARNINGS TRADES 20180523 RDO SFFxO Traders Meetup (Raleigh Durham Open Stocks Futures FX

EARNINGS TRADES PART 2 20190819 RDO SFFxO Traders Meetup (Raleigh Durham Open Stocks

PRO WPF IN VB 2010: WINDOWS PRESENTATION FOUNDATION IN .NET 4 FREE DOWNLOAD Author: Matthew

Pro WPF in VB 2010: Windows Presentation Foundation Pro WPF in VB 2010: Windows Presentation

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

XENAPP & NVIDIA GRID THOMAS POPPELGAARD WHO AM I Thomas Poppelgaard, Technology Evangelist