Jupyter and Spark on Mesos: Best Practices June 21 st , 2017 Agenda - PowerPoint PPT Presentation

Jupyter and Spark on Mesos: Best Practices June 21 st , 2017

Agenda ● About me ● What is Spark & Jupyter ● Demo ● How Spark+Mesos+Jupyter work together ● Experience ● Q & A 1

About me ● Graduated from EE @ Tsinghua Univ. ● Infrastructure Engineer @ Scrapinghub ● Contributor @ Apache Mesos & Apache Spark 1

Apache Spark ● Fast and general purpose cluster computing system ● Provides high level APIs in Java/Scala/Python/R ● Integration with Hadoop ecosystem 1

Why Spark ● Expressive API for distributed computing ● Support both streaming & batch ● Low level API (RDD) & High level DataFrame/SQL ● First-class Python/R/Java/Scala Support ● Rich integration with external data sources: JDBC, HBase, Cassandra, etc. 1

Jupyter Notebook Server ● IPython shell running in the web browser ● Not only code, also markdown & charts ● Interactive ● Ideal for demonstration & scratching http://jupyter-notebook.readthedocs.io/en/latest/noteb ook.html 1

Jupyter Notebook Server 1

Recap Prev: ● Introduction to Spark ● Introduction to Jupyter Notebook Server Next: ● Why Spark on Mesos ● Why Spark+Mesos+Jupyter 1

Why Spark on Mesos 1

Why Spark on Mesos ● Run Spark drivers and executors in docker containers (avoid python dependency hell) ● Run any version of spark! ● Making use of our existing mesos cluster ● Reuse the monitoring system built for mesos 1

Why Spark + Jupyter Notebook ● Run in Local computer ○ Not enough storage capacity for large datasets ○ Not enough compute power to process them ● Run in company cluster ○ takes too long to set up ○ Hard to debug (only through logs) 1

Why Spark + Jupyter Notebook ● Run in Notebook ○ No need to set up - just on click ○ Easy to debug ○ Full access to the cluster’s compute power 1

Recap Prev: ● Why Spark on Mesos ● Why Spark+Mesos+Jupyter Next: ● Demo 1

DEMO 1

Recap Prev: ● Demo Next: ● How Spark and Mesos work together ● Experience & Caveats 1

Mesos & Spark: Mesos Architecture 1

Mesos & Spark: Spark Architecture 1

Mesos & Spark ● A Spark app/driver = a Mesos framework ● Spark executors = Mesos tasks 1

Mesos & Spark: Experience ● Single Cluster ● Marathon for long running services ● Constraints to pin spark tasks on certain nodes 1

Experience - Single Cluster 1

Experience - Single Cluster ● Pros & Cons 1

Experience: Dynamic Allocation is a must ● People tend to leave their spark executors running, even if they end their day of work ● No resource available for new launched spark apps, even if the cluster is doing no work ● Enable dynamic allocation: idle spark executors are terminated after a while 1

Spark Dynamic Allocation ● Spark executors are: ○ Killed after being idle for a while ○ Launched later when there are tasks waiting in the queue ● Requires long-running “spark external shuffle service” on each mesos node 1

Spark Dynamic Allocation - External Shuffle Service 1

Experience: Battery-included docker base image ● Basics: ○ libmesos ○ java 8 ● Libs: ○ Python 2 & Python 3 & libaries ○ Hadoop jars for AWS, kafka jars ● Configuration: ○ Resource spec (cpu/ram) for spark driver/executors ○ Dynamic allocation ○ Constraints: pin spark executors to colocate with HDFS DataNodes 1

Experience: Battery-included docker base image 1

Experience: Save Jupyter notebooks in database ● Jupyter does not support saving notebooks in databases ● but it provides a pluggable storage backend API ● pgcontents: Postgres backend, open sourced by Quantopian ● we ported it to support MySQL (straightforward thx to SQLAlchemy) https://github.com/quantopian/pgcontents https://github.com/scrapinghub/pgcontents/tree/mysql 1

Recap Prev: ● How Spark and Mesos work together ● Experience & Caveats ○ Role & Constraints ○ Dynamic Allocation is a must Next: ● Looking into the future ● Q & A 1

Looking into the Future ● Resource isolation between notebooks ● Python environment isolation between notebooks 1

Spark JobServer ● Learning spark+python is a bit too much for people like sales & QA ● But almost everyone knows about SQL ● So why not we just provide a web ui to execute spark sql? 1

Spark JobServer ● Much like AWS Athena, but tailored to our own use cases 1

Jupyter and Spark on Mesos: Best Practices June 21 st , 2017 Agenda - PowerPoint PPT Presentation

Jupyter and Spark on Mesos: Best Practices June 21 st , 2017 Agenda About me What is Spark & Jupyter Demo How Spark+Mesos+Jupyter work together Experience Q & A 1 About me Graduated from EE @ Tsinghua

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Fault Domains in Mesos Vinod Kone (vinodkone@apache.org) About me Apache Mesos PMC and

JupyterLab: Ian Rose, UC Berkeley Jessica Forde, Jupyter The Evolution of the Jupyter Jason

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Secrets Management in Mesos Vinod Kone ( vinodkone@apache.org ) MesosCon EU 2017 About me

Serverless Jupyter github.com/drola Matja Drolc 1 2 Example Jupyter notebook

Notebook The Larger Jupyter Team @jupyterlab on GitHub @ProjectJupyter on Twitter Vidar Tonaas

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Challenges in Optimizing Job Scheduling on Mesos Alex Gaudio Who Am I? Data Scientist and

Nvidia GPU Support on Mesos: Bridging Mesos Containerizer and Docker Containerizer MesosCon Asia

MESOS & CONTAINERS Overview of Mesos containerization and upcoming filesystem isolation

Mesos + Singularity: Mesos + Singularity: PaaS automation for mortals PaaS automation for

OpenWhisk on Mesos Tyson Norris/Dragos Dascalita Haut, Adobe Systems, Inc. OPENWHISK ON MESOS

APACHE COTTON MySQL on Mesos Yan Xu xujyan 1 SHORT HISTORY Mesos: cornerstone of

Serenity MESOS OVERSUBSCRIPTION MODULE Szymon Konefa SOFTWARE ENGINEER INTEL CORPORATION

MATH 2110Q Practice Exam 3 1 exam3ReviewAnswers.notebook November 03, 2014 2

European Symposia on Algorithms16 Outline Problem Formulation Algorithmic Tools Our

With Podman By Dan Walsh @rhatdan dnf install -y podman dnf install -y podman alias

Understanding SSH: Large-scale measurements and notary-based authentication Final Presentation

A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research

Geographic Data Science - Lecture I Introduction Dani Arribas-Bel Today This course The

Open Notebook Computer Science Open Software Day 2012 Vadim Zaytsev, SWAT, CWI 2012 Open

Agenda Recap on the basics from webinar 1 Who can prepare Receipts & Payments (R&P)

Jupyter and Spark on Mesos: Best Practices June 21 st , 2017 Agenda - PowerPoint PPT Presentation

Jupyter and Spark on Mesos: Best Practices June 21 st , 2017 Agenda About me What is Spark & Jupyter Demo How Spark+Mesos+Jupyter work together Experience Q & A 1 About me Graduated from EE @ Tsinghua

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Fault Domains in Mesos Vinod Kone (vinodkone@apache.org) About me Apache Mesos PMC and

JupyterLab: Ian Rose, UC Berkeley Jessica Forde, Jupyter The Evolution of the Jupyter Jason

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Secrets Management in Mesos Vinod Kone ( vinodkone@apache.org ) MesosCon EU 2017 About me

Serverless Jupyter github.com/drola Matja Drolc 1 2 Example Jupyter notebook

Notebook The Larger Jupyter Team @jupyterlab on GitHub @ProjectJupyter on Twitter Vidar Tonaas

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Challenges in Optimizing Job Scheduling on Mesos Alex Gaudio Who Am I? Data Scientist and

Nvidia GPU Support on Mesos: Bridging Mesos Containerizer and Docker Containerizer MesosCon Asia

MESOS &amp; CONTAINERS Overview of Mesos containerization and upcoming filesystem isolation

Mesos + Singularity: Mesos + Singularity: PaaS automation for mortals PaaS automation for

OpenWhisk on Mesos Tyson Norris/Dragos Dascalita Haut, Adobe Systems, Inc. OPENWHISK ON MESOS

APACHE COTTON MySQL on Mesos Yan Xu xujyan 1 SHORT HISTORY Mesos: cornerstone of

Serenity MESOS OVERSUBSCRIPTION MODULE Szymon Konefa SOFTWARE ENGINEER INTEL CORPORATION

MATH 2110Q Practice Exam 3 1 exam3ReviewAnswers.notebook November 03, 2014 2

European Symposia on Algorithms16 Outline Problem Formulation Algorithmic Tools Our

With Podman By Dan Walsh @rhatdan dnf install -y podman dnf install -y podman alias

Understanding SSH: Large-scale measurements and notary-based authentication Final Presentation

A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research

Geographic Data Science - Lecture I Introduction Dani Arribas-Bel Today This course The

Open Notebook Computer Science Open Software Day 2012 Vadim Zaytsev, SWAT, CWI 2012 Open

Agenda Recap on the basics from webinar 1 Who can prepare Receipts &amp; Payments (R&amp;P)

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

MESOS & CONTAINERS Overview of Mesos containerization and upcoming filesystem isolation

Agenda Recap on the basics from webinar 1 Who can prepare Receipts & Payments (R&P)