Starting with Apache Spark, Best Practices and Learning from the - PowerPoint PPT Presentation

Aug 09, 2023 •357 likes •1.46k views

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal Engineer + Spark Committer Spark@Microsoft Best Practices Enterprise Solutions Resilient - Fault tolerant 19,500+ commits Tungsten AMPLab

Repartition T o numPartitions or by Columns Increase parallelism – will shuffle coalesce() – combine partitions in place
Cache cache() or persist() Flush least-recently-used (LRU) - Make sure there is enough memory! MEMORY_AND_DISK to avoid expensive recompute (but spill to disk is slow)
Streaming Use Structured Streaming (2.1+) If not... If reliable messaging (Kafka) use Direct DStream
Metadata - Config Position from streaming source (aka offset) - could get duplicates! (at-least-once) Pending batches
Persist stateful transformations - data lost if not saved Cut short execution that could grow indefinitely
Direct DStream Checkpoint also store offset Turn off auto commit - do when in good state for exactly- once
Checkpointing Stream/ML/Graph/SQL - more efficient indefinite/iterative - recovery Generally not versioning-safe Use reliable distributed file system (caution on “object store”)
External Data Source Hadoop Hive Hourly ly FrontEnd ntEnd Spark SQL WebLog Hive Metastore HDFS BI T ools
Spark Ne Near-RealT RealTime ime ML (e (end nd-to to-end end round ndtrip: ip: 8-20 20 sec) FrontEnd ntEnd Spark Kafka Streaming Offline HDFS Analysis
BI T ools SQL Appliance Spark SQL Hive BI T ools RDBMS

Recommend

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A

499 views • 36 slides

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark Streaming and Spark SQL Explored Streaming API of Apache Spark on Ukko Cluster Window based Stream Content Direct Stream content

221 views • 9 slides

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros

667 views • 43 slides

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA Jet Propulsion Laboratory Agenda Data and Processing Data Systems Apache OODT Apache Spark Streaming OODT

725 views • 33 slides

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE with Apache CXF Practical JOSE with Apache CXF Practical JOSE with Apache CXF Practical JOSE with Apache CXF What Is Apache CXF Production

465 views • 25 slides

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx Streaming Spark Dataframe Spark Core (RDD) 2 Machine Learning Algorithms Supervised learning Given a set of features and labels Builds a model that

590 views • 24 slides

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more

1.5k views • 52 slides

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 2 / 67 Big Data

1.09k views • 86 slides

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg mats@neotechnology.com opencypher.org | opencypher@googlegroups.com opencypher.org | opencypher@googlegroups.com Cypher for Apache Spark Apache Spark:

281 views • 9 slides

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI * Outline Review of Deep Learning Apache MXNet Framework Distributed Inference using MXNet and Spark Deep Learning Output CAR

652 views • 23 slides

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning

666 views • 53 slides

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is SPARK? A sub-language of Ada 83 and 95 with particular properties that make it ideally suited to the most critical of applications: completely

848 views • 10 slides

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About cziegeler@apache.org @cziegeler RnD Team at Adobe Research Switzerland Member of the Apache So fu ware Foundation Apache Felix and Apache

725 views • 26 slides

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The Apache Way The Apache Way The Apache Way The Apache Way A collaborative slide deck with A collaborative slide deck with A collaborative slide deck

493 views • 45 slides

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian Tzolov Whoami Christian Tzolov Engineer at Pivotal, Big-Data, Hadoop, Spring Cloud Dataflow, Apache Geode, Apache HAWQ, Apache Committer, Apache

796 views • 41 slides

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on

472 views • 36 slides

Markets as incentives for sustainable models of on farm diversity in production systems

Markets as incentives for sustainable models of on farm diversity in production systems Lessons from sub-tropical fruits from India CROP WISE AREA UNDER MAJOR FRUITS IN INDIA TFT Diversity status TFT Diversity status APPLE MANGO 8% 50%

443 views • 23 slides

COSC 2P91 Lets Python Some More! Week 8a Brock University Brock University (Week 8a)

COSC 2P91 Lets Python Some More! Week 8a Brock University Brock University (Week 8a) Lets Python Some More! 1 / 16 Remember Python? Lets all remember Python for a moment Weve gotten a basic look at the syntax (primarily in the

755 views • 16 slides

Recursion symbol table stack frames Recursion Python shell > def

Recursion symbol table stack frames Recursion Python shell > def recursive_function(x): Recursive function if x > 0: print("start", x) recursive_function(x - 1) function that calls itself

563 views • 18 slides

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns the XML for that node. PUT node/#id Updates the node, returns new version number. DELETE node/#id Deletes the node, returns new version number(?). PUT

1.08k views • 40 slides

The ICARUS T600 LAr TPC and its scalability to multi kton detectors A. Menegolli (Univ. Pavia and

The ICARUS T600 LAr TPC and its scalability to multi kton detectors A. Menegolli (Univ. Pavia and INFN) for the ICARUS Collaboration LAr TPC R&D Workshop FNAL, 21 March 2013 Outline The ICARUS T600 Time Projection Chamber

320 views • 21 slides

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July 10, 2019 1 / 69 In this session. . . Performance and Python Profiling tools for Python Fast arrays for Python: Numpy Ramses van Zon HPC Python

1.32k views • 119 slides

THE ANNUAL GENERAL MEETING OF SHAREHOLDERS FOR THE YEAR 2020 WEDNESDAY, APRIL 22, 2020 LOBBY, 1

THE ANNUAL GENERAL MEETING OF SHAREHOLDERS FOR THE YEAR 2020 WEDNESDAY, APRIL 22, 2020 LOBBY, 1 ST FLOOR, AAPICO HITECH PUBLIC COMPANY LIMITED AGENDA To approve the minutes of Annual General Meeting of Shareholders for the 1. year 2019 To

404 views • 39 slides

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su Wang, Raymond J. Mooney University of Texas at Austin Task Authorship Attribution: identify the author of a text, given a set of author-labeled

733 views • 40 slides