11/16/2015 Objectives • Show Spark's ability to rapidly process Big Data • Extracting information with RDDs • Querying data using DataFrames Spark: A Coding Joyride • Visualizing and plotting data • Create a machine-learning pipeline with Spark-ML and MLLib. • We'll also discuss the internals which make Spark 10-100 times faster than Hadoop MapReduce and Hive. Doug Bateman Director of Training, NewCircle About Me About Me For Fun Engineer, Architect & Instructor • Developing with Java since 1995 • Sailing (Java 1.0) • Rock climbing • +15yrs as software developer, • Snowboarding architect, and consultant. • Chess • Director of Training at NewCircle • Curriculum Lead at NewCircle 3 4 Who are you? Environments Workloads 0) I am new to spark. Goal: unified engine across data sources , 1) I have used Spark hands on before… workloads and environments 2) I have more than 1 year hands on experience with spark.. Data Sources 1
11/16/2015 Spark – 100% open source and mature Environments Used in production by over 500 organizations. From fortune 100 to small innovators YARN Workloads DataFrames API Spark SQL Spark Streaming MLlib GraphX RDD API Spark Core {J {JSON} Data Sources Apache Spark: Large user community Large-Scale Usage Commits in the past year 4000 Largest cluster: 8000 nodes Spark 3000 Largest single job: 1 petabyte 2000 Top streaming intake: 1 TB/hour Storm 1000 2014 on-disk 100 TB sort record HDFS YARN MapReduce 0 Spark Physical Cluster On-Disk Sort Record: Time to sort 100TB Spark Driver 2013 Record: 2100 machines JVM Hadoop Hadoop 72 minutes 207 machines 2014 Record: Executor Executor Executor Executor Spar Sp ark 23 minutes Slot Slot Slot Slot Slot Slot Slot Slot JVM JVM JVM JVM 11 Source: Daytona GraySort benchmark, sortbenchmark.org 2
11/16/2015 Spark Physical Cluster Power Plant Demo Spark Driver JVM Executor Executor Executor Executor Task Task Task Slot Slot Slot Task Task JVM JVM JVM JVM Use Case: predict power output given a set of readings from various sensors in a gas‐fired power generation plant Steps: 1. ETL 2. Explore + Visualize Data Schema Definition: AT = Atmospheric Temperature in C 3. Apply Machine Learning V = Exhaust Vacuum Speed AP = Atmospheric Pressure RH = Relative Humidity PE = Power Output (value we are trying to predict) About Databricks About NewCircle Data science made easy Software Development Training for the Enterprise • The Databricks team contributed more than 75% 75% of the code added • Courses tailored for your team to Spark in the past year • Custom learning pathways & • Cloud-based integrated workspace for Apache Spark. training programs • From the original Spark team at UC • Global delivery Berkeley. 17 18 3
11/16/2015 A few of our courses Paul - Salesforce • Spark Developer Bootcamp Learn more at: • Android Internals • Android Testing • Core AngularJS “In all honesty, this is one of https://databricks.com/spark/training • Advanced Python the best technical classes • Fast Track to Java 8 I’ve ever taken (and I’ve • Spring & Hibernate Bootcamp been doing this a very long • Apache HTTPD & Tomcat Administration time).” Bootcamp 19 Thanks! 30 Day Free Trial of Databricks Thank you. Visit: bit.ly/spark-bootcamp 15% off Spark Developer Bootcamp Training Visit: https://newcircle.com/spark Promo Code: QCON15 21 4
11/16/2015 http://training.databricks.com/sparkcamp.zip Transforming RDDs Error, ts, msg1 Info, ts, msg8 Error, ts, msg3 Error, ts, msg4 Warn, ts, msg2 Warn, ts, msg2 Info, ts, msg5 Spark Fundamentals Warn, ts, msg9 Error, ts, msg1 Info, ts, msg8 Info, ts, msg5 Error, ts, msg1 logLinesRDD (input/base RDD) .filter( f(x) ) Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 New Professor Anthony D. Joseph, UC Berkeley Error, ts, msg1 Error, ts, msg1 Strata NYC September 2015 errorsRDD Transformations Actions Actions sc.textFile( " hdfs://log " ) .filter( f(x) ) Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Execute DAG! Error, ts, msg1 Error, ts, msg1 errorsRDD .coalesce( 2 ) .coalesce( 2 ) Error, ts, msg1 Error, ts, msg4 Error, ts, msg3 .collect() Error, ts, msg1 Error, ts, msg1 cleanedRDD .collect() Driver Driver Lifecycle Lifecycle logLinesRDD logLinesRDD errorsRDD .filter( f(x) ) . saveAsTextFile() Error, ts, msg1 Error, ts, msg4 Error, ts, msg3 errorsRDD Error, ts, msg1 Error, ts, msg1 cleanedRDD .coalesce( 2 ) .filter( f(x) ) .count() cleanedRDD Error, ts, msg1 .collect() Error, ts, msg1 Error, ts, msg1 5 errorMsg1RDD .collect() Error, ts, msg1 Error, ts, msg4 Error, ts, msg3 Error, ts, msg1 Error, ts, msg1 Driver 1
11/16/2015 Partition Task Partition Lifecycle logLinesRDD errorsRDD logLinesRDD .cache() (HadoopRDD) . saveAsTextFile() Error, ts, msg1 Error, ts, msg4 Task-1 Error, ts, msg3 Task-2 Error, ts, msg1 Error, ts, msg1 .filter( f(x) ) Task-3 cleanedRDD Task-4 .filter( f(x) ) .count() Error, ts, msg1 errorsRDD Error, ts, msg1 Error, ts, msg1 5 (filteredRDD) errorMsg1RDD .collect() Lifecycle of a Spark Program • Create input RDDs from external data • … or parallelize a collection in your driver program End of Spark • Use transformations to lazily transform them and create new RDDs • … using transformations like filter() or map() Fundamentals Module • Ask Spark to cache() any intermediate RDDs that will be reused • Execute actions to kick off a parallel computation • … such as count() and collect() • Optimized and executed by Spark 2
Recommend
More recommend