data access for data science
play

Data Access for Data Science April 17, 2018 Ja Jacques Nadeau - PowerPoint PPT Presentation

Data Access for Data Science April 17, 2018 Ja Jacques Nadeau Co-Founder & CTO, Dremio PMC Chair, Apache Arrow PMC, Apache Calcite Agenda Apache Arrow Using Dremio for Self Service Data Access Data Access Example (notebook +


  1. Data Access for Data Science April 17, 2018

  2. Ja Jacques Nadeau Co-Founder & CTO, Dremio PMC Chair, Apache Arrow PMC, Apache Calcite

  3. Agenda Apache Arrow • Using Dremio for Self Service Data Access • Data Access Example (notebook + Dremio) • Reflections & Caching Overview • Caching Impact Example •

  4. Getting Data Ready for Analysis Is Hard Data can be hard to find • Many modern data systems have poor quality interfaces • Data is rarely in a single system • Data access is frequently slow • Some types of issues can only be solved by IT tickets • Doing late stage data curation makes reproduction and • collaboration difficult: “do I copy and edit?” there should be a new, self-service data access tier

  5. Apache Arrow

  6. Apache Arrow • Standard for columnar in-memory processing and transport • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relat ional and complex data • Consensus Driven: developed by contributors leading 13+ key OSS projects

  7. Arrow: Fast Exchange, Fast Processing Focus on GPU and CPU Efficiency • Cache locality • Super-scalar and vectorizedoperation • Minimal structure overhead • Constant value access High Performance Sharing & Traditional Arrow Memory Memory Interchange • Zero Overhead Encoding • Scatter/Gather Optimized • Direct Memory definition • Designed for RDMA and shared memory access

  8. Arrow Components Core Libraries • Building Blocks • Major Integrations •

  9. Arrow: Core Libraries Java Library • C++ Library • Python Library • C Library • Ruby Library • JavaScript Library • Rust Library •

  10. Arrow Building Blocks (in project) Plasma Pl Ar Arrow RPC* Shared memory caching layer, RPC/IPC interchange library originally created in Ray (active development) Arrow Kernels* Ar Fe Feather Common data manipulation components Fast ephemeral format for movement of data between R/Python *soon

  11. Arrow Integrations Dr Dremio OSS project, Sabot Engine executes entirely on Arrow memory Pandas Pa Move seamlessly to from Arrow Parquet Pa as a means for communication, serialization, fast processing Read and write Parquet quickly to/from Parquet. C++ library builds directly on Arrow. Spark Sp Supports conversion to Pandas via Arrow construction using GO GOAI (GPU Open Analytics Init) Arrow Java Library Leverages Arrow as internal representation (including libgfd and GPU dataframe)

  12. Apache Arrow Adoption Arrow downloads increased 44x since April (currently ~100K per month) Monthly PyPi (~40% of all downloads)

  13. Dremio a system for self-service data access

  14. About Apache-Licensed • Dremio Built on Apache Arrow, • Apache Calcite, Apache Parquet Launched in July 2017 • Easy extension, • customization and Self-Service Data Platform • enterprise flexibility Make Data Accessible to • SDKs for sources, functions, • whatever tool file formats, security The Narwhal’s name is • Execution, Input and Output • Gn Gnarly are all build on native Ar Arrow

  15. Google Docs for your Data Powerful & Intuitive UX for Data Live Data Curation Find, manage and share data regardless of size & location AI-powered curation of data without creating a single copy

  16. Self-Service Data Access Platform SQL Data Caching Data Catalog Data access at interactive speed, without cubes Data Discovery, Security and Personal Data or BI extracts Assets Data Access Data Curation RDBMS, MongoDB, Elasticsearch, Hadoop, S3, Wrangle, prepare, enrich any source without NAS, Excel, JSON making copies of your data

  17. Data Access Example

  18. Leveraging Underlying Source Capabilities Example

  19. Reflections an advanced form of caching

  20. Access isn’t Enough: Reducing Distance to Data What you want What you want What you want Distance to Data Work to Be Done • Resources Required • Raw data Time to Complete •

  21. The basic concept behind a relational cache What you want What you want What you want New DTD • Maintain derived data that is between what you want and what the raw data Original DTD • Shortens distance to data (DTD) Reflection Cost reduction • Reduces resource requirements & latency • Materialization can be derived from raw Raw data data via arbitrary operator DAG

  22. It doesn’t have to be a trivial relationship… What you want What you want What you want New DTD Original DTD Reflection 1 Reflection 2 Cost reduction Raw data

  23. You already do this today (manually)!! Materializations (manually created): Users choose depending on need: • Cleansed • Data Scientists & Analysts trained to use different tables depending on the use • Partitioned by region or time case • Summarized for a particular purpose • Custom datasets, summarization and/or extraction for modeling, reports and dashboards

  24. Dremio can make the decisions so you don’t have to Copy-and-pick Reflections ? ? ? ? Logical Model Data Sicentist pi picks be best optimization Dremio pi picks best opt ptimization Physical Optimizations (reflections) (t (transform, sort, partition, aggregate) Dremio de designs and d maintains Data Engineer de designs an and mai aintai ains Source Table

  25. Cache Matching: Example Scenarios F(c’ < 10) A(a,b, sum(c)) Aggregation F(c’ < 10) A(a, sum(c) as c’) S(r1) Rollup P(a,c) A(a, sum(c) as c’) S(t1) S(r1) S(t1) Materialization Target A(a, sum(c) as c’) A(id, sum(c)) A(a, sum(c) as c’) Join/Agg Join(r1.id=t2.id) S(r1) Join(t1.id=t2.id) Transposition S(t1) S(r1) S(t2) S(t1) S(t2) Target Materialization S(r1) S(t1) Part by a F(a) Costing & S(r1) Target Materialization Partitioning pruned on a S(t1) S(r1) S(t1) Part by b Target Materialization Alternative Plan User Query Reflection Definition

  26. Reflections • A reflection is a materialization designed to accelerate operations • Transparent to data consumers • Not required on day 1… you can add reflections at any time • One reflection can help accelerate queries on thousands of different virtual datasets (logical definitions) • Reflections are persisted (S3, HDFS, local disks, etc.) so there’s no memory overhead • Columnar on disk (Parquet) and Columnar in memory (Arrow) • Elastic, scales to 1000+ nodes

  27. Reflection Impact Example

  28. In conclusion

  29. Distribution of Responsibilities Da Data Access Platform BYO Data Science & BI BY BI Solutions • • Analyze Data Index, secure, expose, share and curate datasets • Experiment and perform what-if • Expose data from different systems in a analysis standard namespace and • • Derive Conclusions Allow live cleanup and curation capabilities • Data manipulation that should be • Build Models reproducible and shared • … and everything else that results • Disconnect physical concerns from logical in an output that isn’t a dataset needs • Cache intermediate results to support accelerate common user patterns • Get to an in interestin ing g slic lice of data

  30. Self-Service Data Access SQL Data Caching Data Catalog Data access at interactive speed, without cubes Data Discovery, Security and Personal Data or BI extracts Assets Data Access Data Curation RDBMS, MongoDB, Elasticsearch, Hadoop, S3, Wrangle, prepare, enrich any source without NAS, Excel, JSON making copies of your data

  31. Join the Community! Come see me for Office hours! • Download: dremio.com/download • GitHub: github.com/dremio/dremio-oss • github.com/apache/arrow • Dremio Community: community.dremio.com • Arrow Mailing list: dev@arrow.apache.org • Twitter: @intjesus, @DremioHQ, @ApacheArrow •

Recommend


More recommend