Apache Arrow & TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1

Apache Arrow: the project “Apache Arrow is a cross-language development platform for in-memory data” Well established. Top-Level Apache project backed by key developers of a number of opensource projects: Calcite, Cassandra, Drill , Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm. Very active. 119 contributors, https://github.com/apache/arrow Why is ALICE looking into it? • Zero-Copy buffer adoption (well suited for our shared memory backed message passing). • Interoperability with other tools (e.g. Pandas, Spark, Tensorflow, .. ) • ROOT interoperability is of course keystone. 2

Apache Arrow: technical details In memory column oriented storage. Full description https://arrow.apache.org/docs/memory_layout.html. Data is organized in Nullable fields. An extra bitmap can optionally be provided to tell if a given slot in a column is occupied. usual record shredding presented in Google’s Dremel paper) to support No polymorphism. The type in an array can be nested, but there is no polymorphisms available (can be faked via nullable fields). 3 Table s. Tables are made of Column s. Column s are (<metadata>, Array) . An Array is backed by one or multiple Buffer s. Nested types. Usual basic types ( int , float , ..). It’s also possible (via the nested types. E.g. a String is a List<Char> .

Integrating Arrow and TDataFrame: TArrowDS TArrowDS. Arrow is a perfect match as TDataFrame backend, so I wrote Initial development. Quickly done thanks to the hints from Danilo and Enrico. https://github.com/root-project/root/pull/1712. Roughly 3 full days of development. idea and I will probably write one at some point. 4 TArrowDS mimicking TCsvDS . Would be nice extensions. A lazy version of TArrowDS is probably a good

Overall, easy to use API. My comments are really on how to improve things further, not compliants. I am already really pleased with the current API. Bulk API. Current API is entry by entry, however very often you can guarantee at least partial contiguous entries so it would be nice to have a bulk API. Non rectangular sources. Right now all the columns need to have the same set of entries and it’s a construction time imposition. IMHO, it would be nice to to have such a check imposed when performing an action on given columns, not at data source construction time. 5 TDataSource : Feedback auto [ptr, minEntry, maxEntry, size, stride] = \ SetEntryBulk(slot, attemptedMin, attamptedMax);

TDataSink. For my usecase, I need to pass results downstream to a different Support for dynamic partitioning. Given a collection of N objects (say, tracks), spanning M events, I want to be able to say: • Group tracks by event. • Apply reduction function on each group of tracks. • Store results in a separate column. Support for emitting more than one result. Given a collection of N objects I want to be able to emit more than one result which will end up as consecutive entries in a separate column. 6 TDataFrame : Feedback device via a memory mapped region. I can of course have a Foreach() to copy them back to a arrow::Array . It would be nice to be able to write directly to an arrow::Array or at least to get pointer I can adopt in a arrow::Buffer .

Support for combinations. Given a collection of N objects of type T and M Support for joins. Given a collections of associations between elements of a column N and elements of a column M, for each one of the associations 7 TDataFrame : Feedback objects of type V I want to be invoked for all the NxM combinations and emit f(n,m) calculate f(n,m) .

Apache Arrow & TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 - PowerPoint PPT Presentation

Apache Arrow & TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 Apache Arrow: the project Apache Arrow is a cross-language development platform for in-memory data Well established. Top-Level Apache project backed by key developers of a

DDR solution Sprites overview Moving right arrow Moving left arrow Moving down arrow Moving up

Control Strategies 2009 Arrow Canyon Complex Air Cooled Condenser 2 Harry Allen Arrow Canyon

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Red Arrow Beach Marinette, Wisconsin Looking north over Red Arrow Beach. Looking northeast over

To add an arrow in Google Slides click the line icon in the toolbar then select Arrow in

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

How to make a petabyte ROOT file: proposal for managing data with columnar granularity Jim

Flexible Rendering for Multiple Platforms tobias.persson@bitsquid.se Breakdown Introduction

Line Segments and Triangles A line drawing = set of line segments + set of faces. We need to

Object Modeling Chapter 5, Analysis: Exercise 2.6 Draw a sequence diagram for the warehouseOnFire

CS224N NLP Bill MacCartney Gerald Penn Winter 2011 Borrows slides from Chris Manning, Bob

Soc Society for Nutrition Education a and Behavior Annual Con onference Opening Session July

OpenCms Days 2011 Workshop Track: The OpenCms 8 Demo Template Modules in Detail Polina Smagina,

Binarized Mode Seeking for Scalable Visual Pattern Discovery Wei Zhang, Xiaochun Cao, Rui Wang,

Sambuz

Useful Links

Newsletter

Mail Us