Vaex: Out of core dataframes for Python Maarten A. Breddels & - PowerPoint PPT Presentation

Vaex: Out of core dataframes for Python Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638 PyParis - Nov 13/2018

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex)

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Data scientists at Xebia Labs • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Data scientists at Xebia Labs • Core Jupyter-Widgets developer • vaex coauthor • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Data scientists at Xebia Labs • Core Jupyter-Widgets developer • vaex coauthor • Authors of vaex and ipyvolume I live on the internet at: I live on the internet at: @maartenbreddels @N147185 maartenbreddels@gmail.com jovan.veljanoski@gmail.com github.com/maartenbreddels github.com/JovanVeljanoski www.maartenbreddels.com https:/ /www.linkedin.com/in/jovanvel/

Agenda • Why does vaex exist? • What is vaex? • Why is it so fast? • Demos • Summary

Motivation: Gaia

Motivation: Gaia • > 1 billion stars • Sky positions • Distance • Motions • And many more • Errors / Correlations

Motivation: Gaia • > 1 billion stars • Sky positions • Distance • Motions • And many more • Errors / Correlations • Latest data release • 1.7 billion rows • 1.2 TB • 94 columns/features

scatter

scatter density

• How fast can it be done?

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes)

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second • Few cycles per row/object, simple algorithm

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second • Few cycles per row/object, simple algorithm • Histograms/Density/Statistics grids

1d 2d 3d

0d 330,000 rows 1d 2d 3d

0d 330,000 rows mean: -0.083 1d 2d 3d

vaex • ~1 second

vaex • Python library (conda/pip installable) • ~1 second

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…)

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…) • >1 billion rows / sec on a desktop (quad core 3Gz) • >50x faster than scipy.stats.binned_statistic_2d

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…) • >1 billion rows / sec on a desktop (quad core 3Gz) • >50x faster than scipy.stats.binned_statistic_2d • Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…) • >1 billion rows / sec on a desktop (quad core 3Gz) • >50x faster than scipy.stats.binned_statistic_2d • Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet • More • Machine learning (Boosted Trees, K-means, PCA, ..) • Distributed computing (>10 10 rows)

What kind of data?

“Never do a live demo” -Many people Demo notebooks at: https://github.com/maartenbreddels/talk-pyparis-2018

Takeaway

Takeaway • Next generation data frame library (vaex?)

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions • No memory wasted

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions • No memory wasted • No information lost: JIT/derivatives

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions • No memory wasted • No information lost: JIT/derivatives • ML pipelines are a byproduct

• vaex • https://vaex.io • https://github.com/maartenbreddels/vaex • pip install —pre vaex • conda install -c conda-forge vaex • https://github.com/maartenbreddels/talk-pyparis-2018 • maartenbreddels@gmail.com • jovan.veljanoski@gmail.com

Vaex: Out of core dataframes for Python Maarten A. Breddels & - PowerPoint PPT Presentation

Vaex: Out of core dataframes for Python Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638 PyParis - Nov 13/2018 Maarten Breddels Ex: astronomer (working on software for big data and visualization: vaex)

Chapter 1 : Informatics Practices Advance operations Class XII ( As per on dataframes CBSE

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Review of pandas DataFrames PAN DAS F OUN DATION S Dhavide Aruliah Director of Training,

Modern pandas Herv Mignot EQUANCY 1 Building Pipelines with Python Data Size PySpark x100

Introduction to PySpark DataFrames BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty

Introducing DataFrames DATA MAN IP ULATION W ITH PAN DAS Richie Cotton Curriculum Architect

Building a Hosting Platorm with Python Andrew Godwin @andrewgodwin

STATS 701 Data Analysis using Python Lecture 14: Advanced pandas Recap Previous lecture: basics

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Working w ith a DataSet to Create DataFrames W OR K IN G W ITH TH E C L ASS SYSTE M IN P YTH

ePYTHON An implementation of Python for the many-core Epiphany coprocessor Nick Brown, EPCC

VAEX: 1 BILLION ROWS, 1 LAPTOP, SERIOUS DATA SCIENCE JOVAN VELJANOSKI Sr. Data Scientist @

Python 3: The Next Generation +Wesley Chun @wescpy corepython.com OSCON, Jul 2012 I Teach 1

Appending & concatenating Series Merging DataFrames with pandas append() .append():

Ensuring data integrity with asynchronous programming in a cloud IoT core Europython 2020

Inde x ing DataFrames MAN IP U L ATIN G DATAFR AME S W ITH PAN DAS Anaconda Instr u ctor A

Minesh B. Amin mamin @ mbasciences.com http://www.mbasciences.com SciPy 2011 - Python and Core

Pre-Reading Review Search: The Core Idea What is search (a.k.a. state-space search )? For

1 Parallelism in Python The Problem with Shared State Python provides two mechanisms for

Getting Started with Python The Python Interpreter A piece of software that executes

Pi v oting DataFrames MAN IP U L ATIN G DATAFR AME S W ITH PAN DAS Anaconda Instr u ctor

AIOHTTP INTRODUCTION ANDREW SVETLOV andrew.svetlov@gmail.com BIO Use Python for more than 16

Course Overview, Python Basics [Andersen, Gries, Lee, Marschner, Van Loan, White] Interlude: Why

An introduction to Python Andreas Bjerre-Nielsen Agenda 1. Python: what it is; why and how we

Vaex: Out of core dataframes for Python Maarten A. Breddels & - PowerPoint PPT Presentation

Vaex: Out of core dataframes for Python Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638 PyParis - Nov 13/2018 Maarten Breddels Ex: astronomer (working on software for big data and visualization: vaex)

Chapter 1 : Informatics Practices Advance operations Class XII ( As per on dataframes CBSE

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Review of pandas DataFrames PAN DAS F OUN DATION S Dhavide Aruliah Director of Training,

Modern pandas Herv Mignot EQUANCY 1 Building Pipelines with Python Data Size PySpark x100

Introduction to PySpark DataFrames BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty

Introducing DataFrames DATA MAN IP ULATION W ITH PAN DAS Richie Cotton Curriculum Architect

Building a Hosting Platorm with Python Andrew Godwin @andrewgodwin

STATS 701 Data Analysis using Python Lecture 14: Advanced pandas Recap Previous lecture: basics

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

Working w ith a DataSet to Create DataFrames W OR K IN G W ITH TH E C L ASS SYSTE M IN P YTH

ePYTHON An implementation of Python for the many-core Epiphany coprocessor Nick Brown, EPCC

VAEX: 1 BILLION ROWS, 1 LAPTOP, SERIOUS DATA SCIENCE JOVAN VELJANOSKI Sr. Data Scientist @

Python 3: The Next Generation +Wesley Chun @wescpy corepython.com OSCON, Jul 2012 I Teach 1

Appending &amp; concatenating Series Merging DataFrames with pandas append() .append():

Ensuring data integrity with asynchronous programming in a cloud IoT core Europython 2020

Inde x ing DataFrames MAN IP U L ATIN G DATAFR AME S W ITH PAN DAS Anaconda Instr u ctor A

Minesh B. Amin mamin @ mbasciences.com http://www.mbasciences.com SciPy 2011 - Python and Core

Pre-Reading Review Search: The Core Idea What is search (a.k.a. state-space search )? For

1 Parallelism in Python The Problem with Shared State Python provides two mechanisms for

Getting Started with Python The Python Interpreter A piece of software that executes

Pi v oting DataFrames MAN IP U L ATIN G DATAFR AME S W ITH PAN DAS Anaconda Instr u ctor

AIOHTTP INTRODUCTION ANDREW SVETLOV andrew.svetlov@gmail.com BIO Use Python for more than 16

Course Overview, Python Basics [Andersen, Gries, Lee, Marschner, Van Loan, White] Interlude: Why

An introduction to Python Andreas Bjerre-Nielsen Agenda 1. Python: what it is; why and how we

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Appending & concatenating Series Merging DataFrames with pandas append() .append():