SQL for NoSQL and how Apache Calcite can help FOSDEM 2017
Christian Tzolov Engineer at Pivotal BigData, Hadoop, Spring Cloud Dataflow Apache Committer, PMC member Apache {Crunch, Geode, HAWQ, ...} blog.tzolov.net twitter.com/christzolov nl.linkedin.com/in/tzolov Disclaimer This talk expresses my personal opinions. It is not read or approved by Pivotal and does not necessarily reflect the views and opinions of Pivotal nor does it constitute any official communication of Pivotal. Pivotal does not support any of the code shared here. 2
“It will be interesting to see what happens if an established NoSQL database decides to implement a reasonably standard SQL ; The only predictable outcome for such an eventuality is plenty of argument.” 2012, Martin Fowler, P.J.Sadalage, NoSQL Distilled 3
Data Big Bang Why? 4
NoSQL Driving Forces • Rise of Internet Web, Mobile, IoT – Data Volume, Velocity, Variety challenges ACID & 2PC clash with Distributed architectures. CAP, PAXOS instead.. • Row-based Relational Model. Object-Relational Impedance Mismatch More convenient data models: Datastores, Key/Value, Graph, Columnar, Full-text Search, Schema-on-Load … • Infrastructure Automation and Elasticity (Cloud Computing) Eliminate operational complexity and cost. Shift from Integration to application databases … 5
Data Big Bang Implications • Over 150 commercial NoSQL and BigData systems. • Organizations will have to mix data storage technologies! • How to integrate such multitude of data systems? 6
“Standard” Data Process/Query Language? • Functional - Unified Programming pcollection.apply( Read.from (”in.txt")) Model .apply( FlatMapElements . via ((String word) -> asList(word.split("[^a-zA-Z']+"))) • Apache {Beam, Spark, Flink, .apply( Filter . by ((String word)->!word.isEmpty())) Apex, Crunch}, Cascading .apply( Count .<String> perElement ()) • Converging around Apache Batch & Streaming, OLTP Beam • Declarative - SQL • Adopted by many NoSQL SELECT b."totalPrice", c."firstName” FROM "BookOrder" as b Vendors INNER JOIN "Customer" as c • Most Hadoop tasks: Hive and ON b."customerNumber" = c."customerNumber” SQL-on-Hadoop WHERE b."totalPrice" > 0; • Spark SQL - most used production component for 2016 OLAP, EDW, Exploration • Google F1 7
SQL for NoSQL? • Extended Relational Algebra - already present in most NoSql data system • Relational Expression Optimization – Desirable but hard to implement 8
Organization Data - Integrated View Direct (M:N) Single Federated DB (M:1:N) https://issues.apache.org/jira/browse/HAWQ-1235 Organization Data Tools Organization Data Tools SQL/JDBC SQL/JDBC SQL/JDBC SQL/JDBC HAWQ FDBS Apache HAWQ Optimizer, Columnar (HDFS) Calcite Calcite Calcite … SQLAdapter 1 SQLAdapter 2 SQLAdapter n NoSQL 2 NoSQL 1 NoSQL n PXF 2 PXF n PXF 1 Native Native Native API 2 API n API 1 … NoSQL 1 NoSQL n NoSQL 1 9
Single Federated Database Federated External Tables with Apache HAWQ - MPP, Shared-Noting, SQL- on-Hadoop CREATE EXTERNAL TABLE MyNoSQL ( customer_id TEXT, first_name TEXT, last_name TEXT, gender TEXT ) LOCATION ('pxf://MyNoSQL-URL>? FRAGMENTER=MyFragmenter& ACCESSOR=MyAccessor& RESOLVER=MyResolver&') FORMAT 'custom'(formatter='pxfwritable_import'); 10
Apache Calcite? Java framework that allows SQL interface and advanced query optimization, for virtually any data system • Query Parser, Validator and Optimizer(s) • JDBC drivers - local and remote • Agnostic to data storage and processing
Calcite Application • Apache Apex • Apache Drill • Apache Flink • Apache Hive • Apache Kylin • Apache Phoenix • Apache Samza • Apache Storm • Cascading • Qubole Quark • SQL-Gremlin … • Apache Geode 12
SQL Adapter Design Choices SQL completeness vs. NoSql design integrity Catalog – namespaces accessed in queries Schema - collection of schemas and tables • Data Type Conversion Table - single data set, collection of rows RelDataType – SQL fields types in a Table (simple) Predicate Pushdown: Scan , Filter , Projection • Move Computation to Data (complex) Custom Relational Rules and Operations: Sort, Join, GroupBy ... 13
Geode to Calcite Data Types Mapping Geode Cache is mapped into Calcite Schema Create Column Types Calcite Schema Geode Cache (RelDataType) from Geode Value class (JavaTypeFact ctoryImpl) Region 1 Table 1 Col1 Col2 ColN Row1 V(1,1) V(1,2) V(1,N) Key Val Row2 V(2,1) V(2,2) V(2,N) k1 v1 RowM V(M,1) V(M,2) V(M,N) k2 v2 Geode Key/Value is mapped … … into Table Row Region K Table K Regions are mapped into Tables 14
Geode Adapter - Overview SQL SQL/JDBC/ JDBC/ODBC ODBC Parse S SQL, co conv nverts i int nto relationa nal e expression a n and nd optimizes Apache Calcite Push d down t n the r relationa nal expressions ns s supported b by G Geode OQL a and nd f falls b back ck t to t the C Calci cite Spring ng D Data A API f for Enu numerable A Adapter f for t the r rest Enumerable int nteract cting ng w with G Geode Adapter Conv nvert S SQL r relationa nal expressions ns i int nto O OQL q queries Spring Data Geode Adapter Geode (Geode Client) Geode A API a and nd O OQL Geode Server Geode Server Geode Server Data Data Data
Simple SQ L Adapter Initialize <<SchemaFactory>> <<create>> <<Schema>> MySchemaFactory MySchema ! connect jdbc : calcite :model=path-to- model.json +getTableMap():Map<String, Table>) +create(operands):Schema <<create>> defaultSchema: 'MyNoSQL', schemas: [{ name: ’MyNoSQLAdapter, factory : MySchemaFactory ’, ScannableTable, Uses reflection to builds operand : { myNoSqlUrl: … , } <<ScannableTable>> RelDataType from your }] FilterableTable, MyTable value’s class type ProjectableFilterableTable +getRowType(RelDataTypeFactor) Returns an Enumeration +scan(ctx):Ennumerator<Object[]> over the entire target data store <<on scan() create>> SELECT b."totalPrice” Defined in the Linq4j Query FROM "BookOrder" as b sub-project <<Enummerator>> WHERE b."totalPrice" > 0; MyEnummerator Converts MyNoSQL value response into +moveNext() Calcite row data +convert(Object):E <<Get all Data>> My NoSQL 16
Non-Relational Tables (Simple) Scanned without intermediate relational expression. • ScannableTable - can be scanned Enumerable<Object[]> scan (DataContext root); • FilterableTable - can be scanned, applying supplied filter expressions Enumerable<Object[]> scan (DataContext root, List<RexNode> filters ); • ProjectableFilterableTable - can be scanned, applying supplied filter expressions and projecting a given list of columns Enumerable<Object[]> scan (DataContext root, List<RexNode> filters , int[] projects ); 17
Calcite Ecosystem Several “semi-independent” projects. Local and Remote Port of LINQ (Language-Integrated Query) JDBC driver to Java. Linq4j JDBC and Avatica Method for translating executable code into data Expression (LINQ/MSN port) Converts SQL SQL Parser & AST queries Into AST Tree (SqlNode …) Complies Java code generated from linq4j Relational “Expressions”. Part of the Relational Algebra, Interpreter • Relational Expressions physical plan implementer expression, • Row Expression optimizations … • Optimization Rules • Planner … Default (In-memory) Data Enumerable Store Adapter Adapter implementation. Leverages Linq4j 3 rd party Adapters 18
Calcite SQL Query Execution Flow 1. On new SQL query JD JDBC BC delegates to Pr Prepar epare to prepare the query JDBC execution 2. Parse SQL, convert to rel. 1 expressions. Va Valid lidate te and Opti Optimi mize ze Calcite Framework them 2 Geode Adapter Prepare 3. Start building a physical plan from the relation expressions 3 SQL, Interpreter 5 4. Implement the Geode relations and Relational, Enumerable encode them as Expression t n tree Planner 5. Pass the Expression tree to the 4 6 7 Interpreter to generate Java code 2 7 Geode 6. Generate and Compile a Binder Binder Adapter instance that on ‘bind()’ call runs Geodes’ query method 7 7. JDBC uses the newly compiled Geode Binder to perform the query on the Geode Cluster Cluster 19
Calcite Relational Expressions TableScan Input Column Project Ref Row-level Ro Relationa nal expressions expression Filter Literal * Struct field Aggregate RelNode RexlNode access Project, Sort field fields s * Filter, Join co cond nditions ns Join Function call RelTrait Window Intersect expressions Physica cal attribute of a relation Sort 20
Recommend
More recommend