Efficient Data Management and Statistics with Zero-Copy Integration - PowerPoint PPT Presentation

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes Mühleisen SSDBM 2014, 2014-06-30

Collect data Bottleneck, thanks David! Statistical Toolkit Filter, transform Load data Analyze & Plot & aggregate data Data Management System Publish paper/ Profit

Data Transfer Options • “Socket-Style”, JDBC, ODBC, DBI, … • Serialization, copy, copy, copy, Deserialization • In-process embedding • No sockets (hopefully), but still conversion • Shared memory • No transfer altogether by extending DB or stats

Zero-Copy • In the end, C-arrays of native types are everywhere • Hardware does not like Java objects (usually) • Hmm… • In-process data sharing by passing pointers • If both systems are based on C arrays, we could get away with metadata management

Statistics Database SELECT...? (1) 0x00000000 Query Result Statistics Database (2) 0x00000000 0x10000000 Statistics Database Query Result 0x10000000! (3) 0x00000000 0x10000000

Challenges • Compilation & Linking • symbol name clashes are likely • Read/Write synchronization • Memory management (who calls free()) • NA/NULL value encoding? • Complex Objects

Proof-of-Concept Analyze & Plot Filter, transform & aggregate https://github.com/lajus/monetinr

R SEXP Array ... 44 42 43 SEXP Header Reference MonetDB BAT head Column ... 2 0 1 Descriptor BAT Arrays Descriptor Reference Column ... tail 44 42 43 Descriptor

Dress-up BAT MonetDB Descriptor Reference Column tail Descriptor ... R 44 SEXP Header 42 43 Reference + Garbage Collection Fun

Experiments • data.table , high-performance R data access • MonetDB.R , DBI/socket-based DB access • Equvalent systems, different connection • RSQLite , embedded SQL database • Still needs conversion

Predictions • Dedicated data management systems should be better at data management than pimped stats tools • In-process integration should make a big difference once result sets get large • Column store performance gain should be visible

Setup • Typical data management tasks • Selection & Projection • Aggregation • Joins • 10MB, 100MB, 1GB, 10GB datasets • Desktop-class machine, 16 GB RAM, 3.4 Ghz i7, Fedora Linux

1% Rows Selected 10% Rows Selected 50% Rows Selected 10min ● ● ● ● 1min ● ● ● ● Execution Time (log) ● ● ● ● 10s ● ● ● ● ● ● ● ● ● ● 1s ● ● ● ● ● ● ● ● ● ● ● ● ● 100ms ● ● ● ● ● ● ● Beats Sockets & Stats Extensions data.table ● ● RSQLite ● 10ms ● ● MonetDB.R ● ● ● Prototype ● 100 MB 100 MB 100 MB 10 MB 10 MB 10 MB 10 GB 10 GB 10 GB 1 GB 1 GB 1 GB Dataset Size (log)

1 Group 500 Groups 10% Groups ● ● ● 10min ● ● ● Execution Time (log) ● ● ● ● ● ● 1min ● ● 10s ● ● ● ● ● ● ● ● ● ● 1s ● ● ● ● ● ● ● ● ● ● ● ● ● 100ms ● ● ● ● Complex queries / small result sets not worth it data.table ● ● RSQLite ● ● ● 10ms MonetDB.R ● ● Prototype ● 100 MB 100 MB 100 MB 10 MB 10 MB 10 MB 10 GB 10 GB 10 GB 1 GB 1 GB 1 GB Dataset Size (log)

1% Join Partner Size 10% Join Partner Size ● ● 10min ● Execution Time (log) ● ● 1min ● 10s ● ● ● ● ● ● 1s ● ● ● ● ● ● ● ● ● ● 100ms data.table ● ● ● RSQLite ● ● MonetDB.R ● ● ● ● Prototype ● 10ms 100 MB 100 MB 10 MB 10 MB 10 GB 10 GB 1 GB 1 GB Dataset Size (log)

Conclusions • Zero-Copy possible • Vast performance benefits • But • Read/Write access? • Iterative processes? • Optimization?

Special Thanks • Thomas Lumley (R) • Sjoerd Mullender (MonetDB)

R UDFs in MonetDB CREATE FUNCTION kmeans (data FLOAT, ncluster INTEGER) RETURNS INTEGER LANGUAGE R { kmeans(data,ncluster)$cluster }; Watch the next MonetDB release…

quantile(c(.05,.95)) ● PL/R 40 ● ● ● sys 30 ● sqltime Time (s) ● dumbtime ● udftime 20 ● vdumbtime ● plrtime ● 10 ● R in MonetDB ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● 1 K 1 K 1 K 1 K 1 K 10 K 10 K 10 K 10 K 10 K 100 K 100 K 100 K 100 K 100 K 200 K 1 M 1 M 1 M 1 M 10 M 10 M 10 M 10 M 100 M 100 M 100 M 100 M Rows

Thank You! Questions?

Efficient Data Management and Statistics with Zero-Copy Integration - PowerPoint PPT Presentation

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes Mhleisen SSDBM 2014, 2014-06-30 Collect data Bottleneck, thanks David! Statistical Toolkit Filter, transform Load data Analyze & Plot

T Levels/Skills Plan Body Copy Body Copy Body Copy Body Copy Body Copy Body Copy Body Copy Body

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Consortium Zero new HIV infections Zero HIV deaths Zero stigma and discrimination Agenda 1.

DALLAS ZERO WASTE Recycling 101 ZERO WASTE PLAN What is Zero Waste? The planet has limited

VISION ZERO SF: ELIMINATING TRAFFIC DEATHS BY 2024 FEBRUARY 6, 2017 VISION ZERO VISION ZERO SF

Presentation of Platform Zero Incidents Platform Zero Incidents Platform Zero Incidents MENTAL

Zero-knowledge Arguments Proving circuit satisfaibility in zero-knowledge Zero-knowledge In

Vision Zero Insight A new approach to Roads Policing VISION ZERO 2 The Vision Zero Action Plan

Zero copy programming (execution) Or: How to save memory without (necessarily) influencing

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

CS 225 Data Structures Fe February 3 - Li Lifecycle G G Carl Evans Cop Copy Con Constru

VISION ZERO Defined Vision Zero is a strategy to eliminate all traffic fatalities and severe

Zero-Base Budget Overview State of Maine What is Zero-Base Budgeting? Zero-base budgeting is a

SAN FRANCISCOS VISION FOR ZERO TRAFFIC DEATHS SEPTEMBER 8, 2015 WHAT IS VISION ZERO SF?

NAT Behavioral Requirements for TCP Saikat Guha, Kaushik Biswas, Bryan Ford, Paul Francis,

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper Tina Butler, Rei Chi Lee,

Binder tude du mcanisme de communication interprocessus d'Android et de ses vulnrabilits

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price <latexit

Retrofitting Parallelism onto OCaml KC Sivaramakrishnan , Stephen Dolan, Leo white, Sadiq Jaffer,

Defining Encryption Lecture 2 1 Roadmap 2 Roadmap First, Symmetric Key Encryption 2 Roadmap

DelayedMatrixStats Porting the matrixStats API to work with DelayedMatrix objects Peter Hickey

Virtualization in Data Centers ! Data centers use virtualization to improve resource utilization

Efficient Data Management and Statistics with Zero-Copy Integration - PowerPoint PPT Presentation

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes Mhleisen SSDBM 2014, 2014-06-30 Collect data Bottleneck, thanks David! Statistical Toolkit Filter, transform Load data Analyze & Plot

T Levels/Skills Plan Body Copy Body Copy Body Copy Body Copy Body Copy Body Copy Body Copy Body

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Consortium Zero new HIV infections Zero HIV deaths Zero stigma and discrimination Agenda 1.

DALLAS ZERO WASTE Recycling 101 ZERO WASTE PLAN What is Zero Waste? The planet has limited

VISION ZERO SF: ELIMINATING TRAFFIC DEATHS BY 2024 FEBRUARY 6, 2017 VISION ZERO VISION ZERO SF

Presentation of Platform Zero Incidents Platform Zero Incidents Platform Zero Incidents MENTAL

Zero-knowledge Arguments Proving circuit satisfaibility in zero-knowledge Zero-knowledge In

Vision Zero Insight A new approach to Roads Policing VISION ZERO 2 The Vision Zero Action Plan

Zero copy programming (execution) Or: How to save memory without (necessarily) influencing

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

CS 225 Data Structures Fe February 3 - Li Lifecycle G G Carl Evans Cop Copy Con Constru

VISION ZERO Defined Vision Zero is a strategy to eliminate all traffic fatalities and severe

Zero-Base Budget Overview State of Maine What is Zero-Base Budgeting? Zero-base budgeting is a

SAN FRANCISCOS VISION FOR ZERO TRAFFIC DEATHS SEPTEMBER 8, 2015 WHAT IS VISION ZERO SF?

NAT Behavioral Requirements for TCP Saikat Guha, Kaushik Biswas, Bryan Ford, Paul Francis,

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper Tina Butler, Rei Chi Lee,

Binder tude du mcanisme de communication interprocessus d'Android et de ses vulnrabilits

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price &lt;latexit

Retrofitting Parallelism onto OCaml KC Sivaramakrishnan , Stephen Dolan, Leo white, Sadiq Jaffer,

Defining Encryption Lecture 2 1 Roadmap 2 Roadmap First, Symmetric Key Encryption 2 Roadmap

DelayedMatrixStats Porting the matrixStats API to work with DelayedMatrix objects Peter Hickey

Virtualization in Data Centers ! Data centers use virtualization to improve resource utilization

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price <latexit