Maximilian Michels @stadtlegende max@crate.io mxm@apache.org 2 W - PowerPoint PPT Presentation

C R AT E D B : A S E A R C H E N G I N E O R A D ATA B A S E ? B O T H ! H O W W E B U I LT A S Q L D ATA B A S E O N T O P O F E L A S T I C S E A R C H A N D L U C E N E Maximilian Michels @stadtlegende max@crate.io mxm@apache.org

2 W H Y A R E W E TA L K I N G A B O U T T H I S ? • Traditional databases are well-researched and there are plenty of them (Postgres, MySQL, Oracle…) • Scalable search using these can be tricky • Search engines are databases optimized for search and scale (Lucene, Solr, Elasticsearch) • You can’t typically use SQL with Search Engines • Why not stick with an mature query language standard which everybody knows?

“A scalable SQL database optimized for search without the NoSQL bullshit.”

4 C R AT E D B I N A N U T S H E L L • Since 2014: https://github.com/crate/crate • Apache 2.0 licensed (community edition) • Built using Elasticsearch, Lucene, Netty, Antlr, … • SQL-99 compatible • REST / Postgres Wire Protocol / JDBC / Python …

5 W H AT T O E X P E C T • What is great about CrateDB • Easy to setup • No funny APIs / SQL • Great scale out - Massive reads / writes • Container aware • Not so great • Transactions • Foreign keys

U S I N G C R AT E D B

7 C R AT E D B I S J U S T L I K E A S Q L D B • SQL is the only query API •CREATE TABLE fosdem.speakers (id int PRIMARY KEY, name string) •CREATE TABLE fosdem.talks (id INT PRIMARY KEY, title STRING, abstract STRING, speaker INT); •INSERT INTO fosdem.speakers (id, name) VALUES (1, ’max’) •INSERT INTO fosdem.talks (id, title, abstract, speaker) VALUES (1, ‘Talk about CrateDB’, ‘bla’, 1) •SELECT * FROM fosdem.talks t1 LEFT JOIN fosdem.speakers t2 ON t1.id = t2.id

8 B U T T H E R E I S M O R E • CrateDB denormalized (no joins necessary) • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT AS (title STRING, abstract STRING)) • INSERT INTO fosdem.speakers (name, talk) VALUES (‘max’, {title = ‘CrateDB’, abstract = ‘Lorem ipsum’}) • SELECT talk[‘title’] as title FROM fosdem.speakers ORDER BY title

9 C L U S T E R I N G / R E P L I C AT I O N N O D E 1 N O D E 2 N O D E 3 N O D E 4 • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT AS (title STRING, abstract STRING)) • CLUSTERED BY name into 4 shards S H A R D

10 C L U S T E R I N G / R E P L I C AT I O N N O D E 1 N O D E 2 N O D E 3 N O D E 4 • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT AS (title STRING, abstract STRING)) • CLUSTERED BY name into 4 shards • WITH (number_of_replicas = 1) P R I M A RY R E P L I C A

11 PA R T I T I O N E D TA B L E S N O D E 1 N O D E 2 N O D E 3 N O D E 4 • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT as (title = STRING, abstract = STRING), year INT) • CLUSTERED BY name into 4 shards • PARTITIONED BY (year, …) • WITH (number_of_replicas = 1) P R I M A RY R E P L I C A

12 M O R E F E AT U R E S • Aggregations • Geo search • Text Analyzers • UDFs • Snapshots • User management • Schema / Table privileges • SSL encryption • MQTT Ingestion

A R C H I T E C T U R E

14 O N T H E S H O U L D E R S O F G I A N T S • CrateDB: Distributed SQL Execution Engine • Antlr: Parsing of SQL statements • Netty: REST, Postgres Wire Protocol, Web interface • Lucene: Storage, Indexing, Queries • Elasticsearch: Transport, Routing, Replication

15 I N T R O D U C T I O N T O • Lucene stores documents which are CrateDB’s rows • Documents have fields • { _id : ‘123’,   name : ‘Bob’,   title : ‘How I Learned to Stop Worrying   and Love the Bomb’,   text : ‘Lorem ipsum…'   } • Fields are indexed for efficient lookup • Fields have column store for efficient aggregation

16 I N T R O D U C T I O N T O E L A S T I C S E A R C H • Elasticsearch core concepts revolve around indices, shards, and replicas • An index is a document store with n parts,   called shards • Each shard has 0 or more replicas which hold copies of the shard data • Replicas are not only useful for fault tolerance but also increase the search performance

17 H O W TA B L E S R E L AT E T O I N D I C E S A N D S H A R D S "properties":{ “name":{"type":"keyword"}, • Each table in CrateDB is "talks":{"dynamic":"true", "properties":{ represented by an ES index "abstract":{"type":"keyword"}, "title":{"type":"keyword"} with a mapping } } } • Each partition in a TA B L E T 1 T 2 T 3 … partitioned table is I N D E X t1 t2.day1 t2.day2 … t3 … represented by an ES index S H A R X X X … X … D 1 • Partition indices are S H A R X X … X X … D 2 created by encoding the S H A R X X … D 3 partition value in the index S H A R X … D 4 name … …

18 F R O M Q U E RY T O E X E C U T I O N SELECT name, count(*) as talks FROM fosdem.speakers   • WHERE room = ‘hpc’ AND year = 2018 GROUP BY name ORDER BY name C L I E N T P S Q L W E B J D B C R U S T C R A S H P Y T H O N N O D E N O D E 1 N O D E 2 N O D E 4 R E S T / P O S T G R E S PA R S E R A N A LY Z E R P L A N N E R E X E C U T O R N O D E 3 N O D E 5 T R A N S P O RT ( E S ) S T O R A G E ( L U C E N E )

19 A R C H I T E C T U R E H I G H L I G H T S • Distributed storage / Distributed query execution N O D E 1 N O D E 3 • Masterless N O D E 2 N O D E 4 • Replication • Only ephemeral storage needed (Container aware) • Optimized for search: Indexing of all fields with Lucene (tuneable)

H A N D S - O N

21 W H AT C A N Y O U D O W I T H C R AT E D B ? • Monitoring (IoT, Industry 4.0, Cyber Security) • Stream Analysis • Text Analysis • Time Series Analysis • Geospatial Queries

CrateDB Web Interface D E M O

CrateDB Web Interface

C O N C L U S I O N

27 W H AT W E H AV E L E A R N E D • Elasticsearch used Lucene and Netty to built a distributed search engine • CrateDB used Elasticsearch, Lucene, and Netty to built a distributed SQL database • CrateDB is perfect when you • want or have to use SQL • store large amounts of structured or unstructured data • have many thousands of queries per second

28 S E E F O R Y O U R S E L F ! • Try out CrateDB • Download from https://crate.io/download/ • or $ curl try.crate.io | bash • or $ docker run crate • or build from source https://github.com/crate/crate • Check out https://crate.io/docs • Contributions welcome • Check out https://github.com/crate/crate/blob/master/devs/docs/index.rst • Check out the issues • Stackoverflow • Join our Slack channel

T H A N K Y O U ! Maximilian Michels @stadtlegende max@crate.io mxm@apache.org

Maximilian Michels @stadtlegende max@crate.io mxm@apache.org 2 W - PowerPoint PPT Presentation

C R AT E D B : A S E A R C H E N G I N E O R A D ATA B A S E ? B O T H ! H O W W E B U I LT A S Q L D ATA B A S E O N T O P O F E L A S T I C S E A R C H A N D L U C E N E Maximilian Michels @stadtlegende max@crate.io

F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels mxm@apache.org A PA C H E B E

UPAKWESHIP PRESENTS THE U-CRATE 50 A 50 CUBIC FOOT CRATE , PERFECT FOR SMALL MOVES!

UPAKWESHIP PRESENTS THE U-CRATE 100 A 100 CUBIC FOOT CRATE , PERFECT FOR THE ESSENTIALS!

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

- BODMER EXPEDITION November 2008 Maximilian Bodmer Expedition Prince Maximilian German

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Challenges for numerical relativity and gravitational-wave source modeling Emanuele Berti, Johns

ss t r

Welcome (back) Youve watched intro VUCA, change, friction/confusion/underperformance

Outline ! Program Background ! A typical C2 Architecture, Issues to study ! Multi-modeling

ACR 3413 BASIC STRUCTURAL ENGINEERING 3 Lecture 1 Univers rsit ity y Putra a Malaysia ysia

Engineering Code Obfuscation ISSISP 2017 - Tamperproofing Christian Collberg Department of

Security Analysis of India's Electronic Voting Systems Scott Wolchok, Wustrow, Halderman (UMich),

What is tamperproofing? Ensure that a program executes as intended, even in the presence of an

Maximilian Michels @stadtlegende max@crate.io mxm@apache.org 2 W - PowerPoint PPT Presentation

C R AT E D B : A S E A R C H E N G I N E O R A D ATA B A S E ? B O T H ! H O W W E B U I LT A S Q L D ATA B A S E O N T O P O F E L A S T I C S E A R C H A N D L U C E N E Maximilian Michels @stadtlegende max@crate.io

F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels mxm@apache.org A PA C H E B E

UPAKWESHIP PRESENTS THE U-CRATE 50 A 50 CUBIC FOOT CRATE , PERFECT FOR SMALL MOVES!

UPAKWESHIP PRESENTS THE U-CRATE 100 A 100 CUBIC FOOT CRATE , PERFECT FOR THE ESSENTIALS!

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

- BODMER EXPEDITION November 2008 Maximilian Bodmer Expedition Prince Maximilian German

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center &gt; www.tugraz.at 1

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Challenges for numerical relativity and gravitational-wave source modeling Emanuele Berti, Johns

ss t r

Welcome (back) Youve watched intro VUCA, change, friction/confusion/underperformance

Outline ! Program Background ! A typical C2 Architecture, Issues to study ! Multi-modeling

ACR 3413 BASIC STRUCTURAL ENGINEERING 3 Lecture 1 Univers rsit ity y Putra a Malaysia ysia

Engineering Code Obfuscation ISSISP 2017 - Tamperproofing Christian Collberg Department of

Security Analysis of India's Electronic Voting Systems Scott Wolchok, Wustrow, Halderman (UMich),

What is tamperproofing? Ensure that a program executes as intended, even in the presence of an

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb