Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, - PowerPoint PPT Presentation

Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, Rockset

About Rockset Igor ● Search and analytics engine ● Rockset ● Enables data-driven ● Facebook applications ● RocksDB ● GraphQL 2

Overview ● Hardware and people efficiency ● Designing systems for people efficiency ○ Schemaless SQL ○ Converged indexing ○ Serverless architecture

Hardware Efficiency

Hardware Efficiency ● Faster databases ~= less hardware ● How much hardware do I need? ● Important, but not the only thing that matters

People Efficiency

People Efficiency ● How many people do I need? ● How much time do I need?

People Efficiency - Configuration “My query is slow” “Do you have an index?” “What’s your partition key?” “What’s your buffer size?” “You should hire a DBA”

People Efficiency - Organizational Friction ● Pre-cloud era: Application developers blocked on provisioning ● Data scientists blocked on data engineers

People Efficiency - Pipelines

Hardware vs. People Efficiency ● Hardware is frequently cheaper than people ● Increase hardware efficiency - spend less money ● Increase people efficiency - spark creativity

Designing Systems for People Efficiency

Rockset ● Search and analytics engine ● “Shortest path from data to applications” ● Connect to data sources or streams ● Execute fast queries

Schemaless SQL

Choosing the Query Language ● SQL is the obvious choice ● Maximize usefulness ● Existing knowledge ● Ecosystem of tools

Querying existing data sources OLTP Data Lake Social SQL Sensors Logs Email/Docs Files Web/Mobile

Querying existing data sources OLTP Data Lake Social SQL Sensors Logs Email/Docs ...but first, let me define a schema Files Web/Mobile

SQL Schema ● Drag on people efficiency ● Messy data ● Complex ETL jobs

Schemaless SQL ● “Smart schema” ● Frictionless data onboarding ● Data scientists no longer blocked on data engineers ● Performance overhead? https://rockset.com/blog/using-smart-schema-to-accelerate-insights-from-nested-json/

Schemaless SQL - Storage Data Schema name: String age: Int Strict schema John 35 Schemaless “name”: S “John” “age”: I 35 Schemaless name: 0 age: 1 0: S “John” 1: I 35 (with field interning)

Schemaless SQL - Query Execution Rows 1 10 7 4 5 Strict schema Columns a b c d e I 1 I 10 I 7 I 4 I 5 Schemaless Columns S a S b I 3 I 5 S e I 1 10 7 4 5 Schemaless Columns (with type hoisting) M S a S b I 3 I 5 S e

Schemaless SQL ● Superior user experience ● Field interning reduces storage overhead ● Type hoisting reduces query execution overhead

Converged indexing

Converged Indexing ● “Query is slow because of the missing index”

Converged Indexing ● “Query is slow because of the missing index” Index all the fields!

Background on Indexing ● Columnar storage ● Search indexing

Columnar Storage Store each column separately ● Great compression ● Only fetch columns the query needs ● 29

Columnar Storage Store each column separately ● “name” Great compression ● Only fetch columns the query needs 0 Igor ● 1 Dhruba <doc 0> “interests” { “name”: “Igor”, 0.0 databases “interests”: [“databases”, “snowboarding”], 0.1 snowboarding “last_active”: 2019/3/15 } 1.0 cars 1.1 databases <doc 1> { “last_active” “name”: “Dhruba”, 0 2019/3/15 “interests”: [“cars”, “databases”], “last_active”: 2019/3/22 1 2019/3/22 } 30

Columnar Storage Advantages Disadvantages Cost effective High write latency ● ● Narrow queries, wide tables High minimum read latency ● ● Scan queries Not suitable for online ● ● Analytical queries applications ● 31

Search Indexing For each value, store documents containing that value (posting list) ● Quickly retrieve a list of document IDs that match a predicate ● 32

Search Indexing For each value, store documents containing that value (posting list) ● Quickly retrieve a list of document IDs that match a predicate ● “name” <doc 0> Dhruba 1 { “name”: “Igor”, Igor 0 “interests”: [“databases”, “snowboarding”], “interests” “last_active”: 2019/3/15 } databases 0.0; 1.1 cars 1.0 <doc 1> { snowboarding 0.1 “name”: “Dhruba”, “interests”: [“cars”, “databases”], “last_active” “last_active”: 2019/3/22 2019/3/15 0 } 2019/3/22 1 33

Search Indexing Advantages Disadvantages High selectivity queries Slower analytical queries ● ● Low latency queries ● Suitable for online applications ● 34

Converged Indexing Columnar and search indexes in the same system ● Built on top of key-value store abstraction ● Each document maps to many key-value pairs ● 35

Converged Indexing Columnar and search indexes in the same system ● Built on top of key-value store abstraction ● Each document maps to many key-value pairs ● Key Value Row Store R.0.name Igor <doc 0> R.1.name Dhruba { “name”: “Igor” C.name.0 Igor Column Store } C.name.1 Dhruba <doc 1> { Search index S.name.Dhruba.1 “name”: “Dhruba” } S.name.Igor.0 36

Converged Indexing - Queries Fast analytical queries + fast search queries ● Optimizer picks between columnar store or search index ● 37

Converged Indexing - Queries Fast analytical queries + fast search queries ● Optimizer picks between columnar store or search index ● SELECT * SELECT keyword, count(*) FROM search_logs FROM search_logs WHERE keyword = ‘datacouncil’ GROUP BY keyword AND locale = ‘en’ ORDER BY count(*) DESC Search index Columnar store 38

Converged Indexing - Writes One document write results in many ● key-value store writes Use write-optimized key-value store - ● RocksDB 39

Converged Indexing - Writes new keys One document write results in many ● key-value store writes Memory Manager Use write-optimized key-value store - ● Memory Buffer RocksDB Storage SST 1 SST 2 background compaction SST 3 SST 4 40

Converged Indexing ● Fast queries out of the box ● Real-time index writes Less efficient More efficient ● Database configuration ● Storage ● Queries ● Writes 41

Serverless Architecture

Serverless Architecture ● Rockset is a cloud service ● No need to manage hardware ● Seamless autoscale 43

Storage in the Cloud Rockset SQL API Distributed Data is sharded across leaves ● Log Aggregator Aggregator Leaf Leaf Leaf RocksDB RocksDB RocksDB 44

Storage in the Cloud Rockset SQL API Distributed Data is sharded across leaves ● Log Aggregator Aggregator RocksDB-Cloud keeps consistent ● copy in cloud object storage Leaf Leaf Leaf RocksDB-Cloud RocksDB-Cloud RocksDB-Cloud RocksDB RocksDB RocksDB SST files SST files SST files Object Storage (AWS S3, GCS, Minio, ...) 45

Adding new read replica Rockset SQL API Distributed Copy data to a new leaf ● Log Aggregator Aggregator Leaf Leaf Leaf Leaf RocksDB-Cloud RocksDB-Cloud RocksDB-Cloud RocksDB-Cloud RocksDB RocksDB RocksDB RocksDB SST files SST files SST files Object Storage (AWS S3, GCS, Minio, ...) 46

Adding new read replica Rockset SQL API Distributed Copy data to a new leaf ● Log Aggregator Aggregator Tail new updates from log ● Leaf Leaf Leaf Leaf RocksDB-Cloud RocksDB-Cloud RocksDB-Cloud RocksDB-Cloud RocksDB RocksDB RocksDB RocksDB SST files SST files SST files Object Storage (AWS S3, GCS, Minio, ...) 47

Adding new read replica Rockset SQL API Distributed Copy data to a new leaf ● Log Aggregator Aggregator Tail new updates from log ● Able to serve more queries ● Leaf Leaf Leaf Leaf RocksDB-Cloud RocksDB-Cloud RocksDB-Cloud RocksDB-Cloud RocksDB RocksDB RocksDB RocksDB SST files SST files SST files Object Storage (AWS S3, GCS, Minio, ...) 48

Conclusion

Conclusion No need to configure... ● Schemaless SQL ...schema ● Converged indexing ...indexes ● Serverless architecture ...servers 50

Conclusion ● Rockset - “shortest path from data to applications” ● Making workflows easy catalyzes creativity 51

Thank you

Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, - PowerPoint PPT Presentation

Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, Rockset About Rockset Igor Search and analytics engine Rockset Enables data-driven Facebook applications RocksDB GraphQL 2 Overview Hardware and people

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

SQL SQL SQL = Structured Query Language Standard query language for relational

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

The SQL Procedure Language (SQL PL) Tony Andrews Themis Education tandrews@themisinc.com

SQL & MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL Developed by IBM (for System R) in the 1970s. Standard used by many vendors.

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

Advanced SQL 01 The Core of SQL Torsten Grust Universitt Tbingen, Germany 1 The Core

Architecting the Internet of Things Dieter Uckelmann Mark Harrison Florian Michahelles

Architecting Java solutions for CICS Architecting Java solutions for CICS Course introduction

Front End Digitization of tRPCs in the ATLAS Muon Spectrometer Phase-I Upgrade X.T. Meng 1,2 1

Latency: the #1 metric of your cloud Boyan Krosnov Chief of Product Cloud architect

Latency_nice Implementation and Use-case for Scheduler Optimization Parth Shah

but quite a lot is. Coordination among users can help with anonymity. Debajyoti Das 1 Sebastian

Demystifying the Real-Time Linux Scheduling Latency Daniel Bristot de Oliveira , Daniel Casini,

Welcome! This event will start shortly. In the meantime, visit us at OhioFamiliesEngage.osu.edu

Cyber OverVue Product Launch What We Deliver Data Solutions Web Connectivity One of the

Kick-star*ng Kanban ( KB101) Rick Simmons

Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, - PowerPoint PPT Presentation

Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, Rockset About Rockset Igor Search and analytics engine Rockset Enables data-driven Facebook applications RocksDB GraphQL 2 Overview Hardware and people

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

SQL SQL SQL = Structured Query Language Standard query language for relational

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

The SQL Procedure Language (SQL PL) Tony Andrews Themis Education tandrews@themisinc.com

SQL &amp; MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL Developed by IBM (for System R) in the 1970s. Standard used by many vendors.

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

Advanced SQL 01 The Core of SQL Torsten Grust Universitt Tbingen, Germany 1 The Core

Architecting the Internet of Things Dieter Uckelmann Mark Harrison Florian Michahelles

Architecting Java solutions for CICS Architecting Java solutions for CICS Course introduction

Front End Digitization of tRPCs in the ATLAS Muon Spectrometer Phase-I Upgrade X.T. Meng 1,2 1

Latency: the #1 metric of your cloud Boyan Krosnov Chief of Product Cloud architect

Latency_nice Implementation and Use-case for Scheduler Optimization Parth Shah

but quite a lot is. Coordination among users can help with anonymity. Debajyoti Das 1 Sebastian

Demystifying the Real-Time Linux Scheduling Latency Daniel Bristot de Oliveira , Daniel Casini,

Welcome! This event will start shortly. In the meantime, visit us at OhioFamiliesEngage.osu.edu

Cyber OverVue Product Launch What We Deliver Data Solutions Web Connectivity One of the

Kick-star*ng Kanban ( KB101) Rick Simmons

SQL & MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is