IPython Notebook as a Unified Data Science Interface for Hadoop - PowerPoint PPT Presentation

IPython Notebook as a Unified Data Science Interface for Hadoop Casey Stella Spring, 2015 Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Table of Contents Preliminaries Data Science in Hadoop Unified Environment Demo Questions Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Introduction • I’m a Principal Architect at Hortonworks • I work primarily doing Data Science in the Hadoop Ecosystem • Prior to this, I’ve spent my time and had a lot of fun ◦ Doing data mining on medical data at Explorys using the Hadoop ecosystem ◦ Doing signal processing on seismic data at Ion Geophysical using MapReduce ◦ Being a graduate student in the Math department at Texas A&M in algorithmic complexity theory Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Data Science in Hadoop Hadoop is a great environment for data transformation, but as a data science environment it poses challenges. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Data Science in Hadoop Hadoop is a great environment for data transformation, but as a data science environment it poses challenges. • A single system where both data transformation and data science algorithms can be expressed naturally can be a challenging line to toe. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Data Science in Hadoop Hadoop is a great environment for data transformation, but as a data science environment it poses challenges. • A single system where both data transformation and data science algorithms can be expressed naturally can be a challenging line to toe. • The popular languages of data science with mature external libraries do not coincide with the JVM languages. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Data Science in Hadoop Hadoop is a great environment for data transformation, but as a data science environment it poses challenges. • A single system where both data transformation and data science algorithms can be expressed naturally can be a challenging line to toe. • The popular languages of data science with mature external libraries do not coincide with the JVM languages. • A system to represent the output of data science and analysis, summary analysis and visualizations, can often are either limited in scope of capabilities or require extensive custom coding. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Data Science in Hadoop Hadoop is a great environment for data transformation, but as a data science environment it poses challenges. • A single system where both data transformation and data science algorithms can be expressed naturally can be a challenging line to toe. • The popular languages of data science with mature external libraries do not coincide with the JVM languages. • A system to represent the output of data science and analysis, summary analysis and visualizations, can often are either limited in scope of capabilities or require extensive custom coding. A unified environment for data science is elusive, but we do have a great start with the Python bindings of Spark and IPython Notebook. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Apache Spark Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Apache Spark • The ability to “reach-out” to languages with heavy data science algorithm support. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Apache Spark • The ability to “reach-out” to languages with heavy data science algorithm support. PySpark Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Apache Spark • The ability to “reach-out” to languages with heavy data science algorithm support. PySpark • Strong, seamless SQL integration. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Apache Spark • The ability to “reach-out” to languages with heavy data science algorithm support. PySpark • Strong, seamless SQL integration. SparkSQL Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Apache Spark • The ability to “reach-out” to languages with heavy data science algorithm support. PySpark • Strong, seamless SQL integration. SparkSQL • Ability to visualize and report summary data. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Unified Data Science Environment What are the components of a unified data science environment? • A single environment supporting mixed-mode local and distributed processing. Apache Spark • The ability to “reach-out” to languages with heavy data science algorithm support. PySpark • Strong, seamless SQL integration. SparkSQL • Ability to visualize and report summary data. IPython Notebook Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Apache Spark Apache Spark is an alternative computing system which can run on Yarn and provides • An Elegant, Rich and Usable Core API • An Expansive set of ecosystem libraries built around the Core API • Hive compatibility via SparkSQL • Mature Python support for both core APIs as well as the spark ecosystem projects Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Spark: Core Ideas Core API facilitates expressing algorithms in terms of transformations of distributed datasets • Datasets are Distributed and Resilient (so named RDDs) • Datasets are automatically rebuilt on failure • Datasets have configurable persistence • Transformations are parallel (e.g. map, reduceByKey, filter) • Transformations support some relational primitives (e.g. join, cartesian product) Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

PySpark: Python Bindings In addition to Java and Scala, Spark has solid integration with Python: • Supports the standard CPython interpreter • There is Python support for the Spark core APIs and most ecosystem APIs, such as MLLib. • IPython Notebook support comes out of the box Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Spark: SQL Integration The Spark component which lets you query structured data in Spark using SQL is called Spark SQL • Has integrated APIs in Python, Scala and Java • Allows you to integrate Spark Core APIs with SQL • Provides Hive metastore integration so that data managed in Hive can be seamlessly processed via Spark Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

Open Payments Data Sometimes, doctors and hospitals have financial relationships with health care manufacturing companies. These relationships can include money for research activities, gifts, speaking fees, meals, or travel. The Social Security Act requires CMS to collect information from applicable manufacturers and group purchasing organizations (GPOs) in order to report information about their financial relationships with physicians and hospitals. Let’s use Python and Spark via IPython Notebook to explore this dataset on Hadoop. Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

IPython Notebook as a Unified Data Science Interface for Hadoop - PowerPoint PPT Presentation

IPython Notebook as a Unified Data Science Interface for Hadoop Casey Stella Spring, 2015 Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015 Table of Contents Preliminaries Data Science in

iPython Data Analytics in Python 1 / 13 The SciPy Stack SciPy is a Python-based ecosystem of

Introductory Scientific Computing with Python IPython notebooks FOSSEE Department of Aerospace

Parallel computing with IPython: an application to air polution modeling Josh Hemann, Rogue Wave

Engineering Notebook What is an Engineering Notebook? An engineering notebook helps a team to

Lecture 4: Cylinders, quadric surfaces and vector functions 1 math2110L4Full.notebook 2

8.6.20 1 English Term 6 Week 2.notebook June 06, 2020 8.6.20 2 English Term 6 Week 2.notebook

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

Wait, IPython can do that?! Sebastian Witowski $ whoami Python consultant and trainer

Monday - Art 1 Week 3 Afternoons.notebook June 12, 2020 Monday - Art 2 Week 3

English day 3 retell.notebook June 05, 2020 English day 3 retell.notebook June 05, 2020 Listen

MATH 2110Q Practice Exam 3 1 exam3ReviewAnswers.notebook November 03, 2014 2

Monday - Art 1 Week 6 Afternoons.notebook July 03, 2020 Monday - Art 2 Week 6

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

We are learning to : discuss what we have written with others Just for today we will compare

JupyterLab: Ian Rose, UC Berkeley Jessica Forde, Jupyter The Evolution of the Jupyter Jason

H OW TO INSTALL J UPYTER ? Generally: pip install jupyter But what if you dont

Setting Up Spark, PySpark and Notebook Setting up your workstation Well Session Outline Set

PyTorch and Neural Nets Review Session CS285 Instructor: Vitchyr Pong Goal of this course

Mod 1 Unit 1 Lesson 2 Lecture Slides.notebook September 09, 2015 0 81 1 Mod 1 Unit 1 Lesson

The SciPy Stack Data Analytics in Python 1 / 9 Data Analytics/Scientific Computing Gaining

Mod 3 Unit 7 Lesson 6 General Prisms and Cylinders and Their Cross Sections Lecture

Sambuz

Useful Links

Newsletter

Mail Us