Optimizing Spark Greg Novak If youve thought about this at all, you - PowerPoint PPT Presentation

Optimizing Spark Greg Novak

If you’ve thought about this at all, you won’t learn anything from me today If you haven’t thought about this, you’ll learn a few principles to organize your thinking Proprietary and confidential 2

Know what you want to measure You don’t want to measure run times You want to measure effective performance of some machine characteristic: network bandwidth, file access latency, or CPU operations per second Proprietary and confidential 3

You do this with carefully constructed data sets To measure network bandwidth, construct a data set with the same number of files (so file access latency is constant) and do the same operation on it (so that cpu operations are constant) but force some extra data with variable size (e.g. random 1 byte ints vs. random 8 byte ints) to come along for the ride. Then take difference of run times. Proprietary and confidential 4

Case Study: Effective Network Bandwidth Everything seemed to run slowly under Spark 2.0... Latency and CPU performance looked fine But we got terrible network bandwidth from Spark 2.0 Not necessarily intrinsic to Spark 2.0… could have been some detail of our setup However Spark 2.1 worked fine, so we just decommissioned our Spark 2.0 setup Proprietary and confidential 5

How do you know if you’re getting your money’s worth out of parallelization? Proprietary and confidential 6

Run time vs. Number of Executors Probably the first plot you draw… but doesn’t really tell you what you want to know Proprietary and confidential 7

Overall Cost (in dollars if possible) vs. executors In a perfect world (linear speed-ups) cost is independent of parallelism In the real world costs generally rise with parallelism Proprietary and confidential 8

Benefit: 1/walltime = answers per hour 1 hour vs. 2 hours: Probably not a big deal 1 week vs. 2 weeks: Probably is a big deal 1 minute vs 10 minutes is a huge deal: Too easy to get distracted if your debug cycle is 10 minutes. Proprietary and confidential 9

Once you are crisp on the costs and benefits, you will be in a position to say things like: “If I double the amount of parallelism for this job, my AWS bill will rise by 30 pct and the job will run in 45 minutes instead of 60 minutes. Does that seem worth it to me?” Proprietary and confidential 10

Recap Focus on measuring performance of intrinsic machine characteristics like network bandwidth to characterize performance Use carefully constructed data sets that change one and only one thing to do it Be crisp on costs (dollars) and benefits (essentially debug cycles per hour) of parallelism to make informed choices about whether you want more or less of it. Proprietary and confidential 11

Optimizing Spark Greg Novak If youve thought about this at all, you - PowerPoint PPT Presentation

Optimizing Spark Greg Novak If youve thought about this at all, you wont learn anything from me today If you havent thought about this, youll learn a few principles to organize your thinking Proprietary and confidential 2 Know

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Investor Presentation Second Quarter 2017 Cautionary Statements Forward-Looking Information This

Algebra Based Physics Simple Harmonic Motion 2015-11-30 www.njctl.org Slide 3 / 69 Slide 4 /

CASE MANAGEMENT OUTSOURCING MH/ ID/ EI ASSUMPTIONS See detailed assumptions in separate

HERAKLION 2019 Biogas production in pilot digesters 7 th International Conference on treating a

UNIT V Contents Three Phase Induction Motor Construction of I.M Three Phase Induction

Manufacture of Plastics Products - II Moulding Processes for Plastics in which All the three

DLR Modular-Free-Shapeable CNG Tank A Hybrid, Composite Intensive Design Current State

INVESTOR DAY September 30, 2014 Agenda & Speakers Peter Evensen President & CEO Teekay

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Optimizing Spark Greg Novak If youve thought about this at all, you - PowerPoint PPT Presentation

Optimizing Spark Greg Novak If youve thought about this at all, you wont learn anything from me today If you havent thought about this, youll learn a few principles to organize your thinking Proprietary and confidential 2 Know

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Investor Presentation Second Quarter 2017 Cautionary Statements Forward-Looking Information This

Algebra Based Physics Simple Harmonic Motion 2015-11-30 www.njctl.org Slide 3 / 69 Slide 4 /

CASE MANAGEMENT OUTSOURCING MH/ ID/ EI ASSUMPTIONS See detailed assumptions in separate

HERAKLION 2019 Biogas production in pilot digesters 7 th International Conference on treating a

UNIT V Contents Three Phase Induction Motor Construction of I.M Three Phase Induction

Manufacture of Plastics Products - II Moulding Processes for Plastics in which All the three

DLR Modular-Free-Shapeable CNG Tank A Hybrid, Composite Intensive Design Current State

INVESTOR DAY September 30, 2014 Agenda &amp; Speakers Peter Evensen President &amp; CEO Teekay

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

INVESTOR DAY September 30, 2014 Agenda & Speakers Peter Evensen President & CEO Teekay