Apache Cassandra for Big Data Applications Christof Roduner Java - PowerPoint PPT Presentation

Apache Cassandra for Big Data Applications Christof Roduner Java User Group Switzerland COO and co-founder January 7, 2014 christof@scandit.com

AGENDA 2  Cassandra origins and use  How we use Cassandra  Data model and query language  Cluster organization  Replication and consistency  Practical experience

WHAT IS CASSANDRA? 3 SQL

WHAT IS CASSANDRA? 4 SQL not only

ORIGINS 5 Dynamo BigTable distributed storage data model

USED BY… 6

SCANDIT 7 ETH Zurich startup company  Our mission: provide the best mobile barcode scanning platform  Customers: Bayer, Coop, CapitalOne, Saks 5 th Avenue, Nasa, …  Barcode scanning SDKs for:   iOS, Android  Phonegap  Titanium  Xamarin de Scanner SDK iOS v3.0.0 De

SCANDIT 8

THE SCANALYTICS PLATFORM 9 Two purposes: 1. External tool for app publishers: App-specific real-time usage statistics  Insights into user behavior  What do users scan?   Product categories? Groceries, electronics, books, cosmetics, …? Where do users scan?   At home? Or while in a retail store?  Top products and brands 2. Internal tool for our algorithms team : Improve our image processing algorithms  Detect devices and OS versions with camera issues  Monitor scan performance of our SDK 

BACKEND REQUIREMENTS 12  Analysis of scans  Accept and store high volumes of scans  Keep history of billions of camera parameters  Generate statistics over extended time periods  Provide reports to developers

BACKEND DESIGN GOALS 13  Scalability  High-volume storage  High-volume throughput  Support large number of concurrent client requests (mobile devices)  Availability  Low maintenance  Even as our customer base grows  Multiple data centers

WHY DID WE CHOOSE CASSANDRA? 14 Partitioning A..J K..R S..Z

WHY DID WE CHOOSE CASSANDRA? 15 Simplicity Coordi- Master Slave nator

MORE REASONS… 16  Looked very fast  Even when data is much larger than RAM  Performs well in write-heavy environment  Proven scalability  Without downtime  Tunable replication  Data model  YMMV…

WHAT YOU HAVE TO GIVE UP 17  Joins  Referential integrity  Transactions  Expressive query language (nested queries, etc.)  Consistency (tunable, but not by default…)  Limited support for secondary indices

HELLO CQL 18 CREATE TABLE users ( username TEXT, email TEXT, web TEXT, phone TEXT, PRIMARY KEY (username) );

HELLO CQL 19 INSERT INTO users (username, email, phone) VALUES ('alice', CREATE TABLE users ( 'alice@example.com', username TEXT, '123-456-7890'); email TEXT, web TEXT, INSERT INTO users (username, email, web) phone TEXT, VALUES ('bob', PRIMARY KEY (username) 'bob@example.com', ); 'www.example.com');

HELLO CQL 20 INSERT INTO users (username, email, phone) VALUES ('alice', CREATE TABLE users ( 'alice@example.com', username TEXT, '123-456-7890'); email TEXT, web TEXT, INSERT INTO users (username, email, web) phone TEXT, VALUES ('bob', PRIMARY KEY (username) 'bob@example.com', ); 'www.example.com'); cqlsh:demo> SELECT * FROM users; username | email | phone | web ----------+-------------------+--------------+----------------- bob | bob@example.com | null | www.example.com alice | alice@example.com | 123-456-7890 | null

FAMILIAR… BUT DIFFERENT 21 CREATE TABLE users ( username TEXT, email TEXT, web TEXT, No auto increments phone TEXT, PRIMARY KEY (username) (use natural key or ); UUID instead) Primary key always mandatory

UNDER THE HOOD: CLUSTER ORGANIZATION 24 Range 1-64, Node 1 stored on node 2 Token 0 Node 4 Node 2 Token 192 Token 64 Range 65-128, Node 3 stored on node 3 Token 128

STORING A ROW 25 Range 1-64, stored on node 2 Calculate md5 hash for row key 1. (the “username” field in the example above) Example : md5(“alice") = 48 Node 1 Token 0 Determine data range for hash 2. Example : 48 lies within range 1-64 Node 2 Node 4 Store row on node responsible 3. Token 64 Token 192 for range Example : store on node 2 Node 3 Token 128 Range 65-128, stored on node 3

IMPLICATIONS 26  Cluster automatically balanced  Load is shared equally between nodes  No hotspots  Scaling out?  Easy  Divide data ranges by adding more nodes  Cluster rebalances itself automatically  Range queries not possible  You can’t retrieve «all rows from A-C»  Rows are not stored in their «natural» order  Rows are stored in order of their md5 hashes

UNDER THE HOOD: PHYSICAL STORAGE 28 A physical row stores data in name-  INSERT INTO users value pairs (“cells”) (username, email, phone) VALUES ('alice',  Cell name is CQL field name (e.g. “email”) 'alice@example.com',  Cell value is field data (e.g. “bob@example.com”) '123-456-7890'); Cells in row are automatically sorted INSERT INTO users  (username, email, web) VALUES by name (“email” < “phone” < “web”) ('bob', Cell names can be different in rows 'bob@example.com',  'www.example.com'); Up to 2 billion cells per row  email: web: bob bob@example.com www.example.com email: phone: Physical row with alice alice@example.com 123-456-7890 row key “alice”

TWO BILLION CELLS 30 CREATE TABLE users ( username TEXT, email TEXT, Who needs 2 billion web TEXT, phone TEXT, fields in a table?!? address TEXT, spouse TEXT, hobbies TEXT, … hair_color TEXT, favorite_dish TEXT, pet_name TEXT, favorite_bands TEXT, … two_billionth_field TEXT, PRIMARY KEY (username) );

2 BILLION CELLS: WIDE ROWS 31 Use case: track logins of users  Data model:   One (wide) physical row per user  User name as row key  Login details (time, IP address, user agent) in cells  Cells ordered and grouped (“ clustered ”) by login timestamp  Cells are now tuple-value pairs Advantage: range queries!  [2014-01-29 , [2014-01-29 , [2014-01-30 , [2014-01-30 , alice ip_address]: agent]: ip_address]: agent]: 66.249.66.183 Firefox 208.115.113.86 Firefox … [2014-01-23 , [2014-01-23 , bob agent]: ip_address]: Chrome 205.29.190.116

2 BILLION CELLS: WIDE ROWS 32 Use case: track logins of users  CREATE TABLE logins ( username TEXT, Data model:  timestamp TIMESTAMP,  One (wide) physical row per user ip_address TEXT,  User name as row key agent TEXT,  Login details (time, IP address, user agent) in cells PRIMARY KEY (username, timestamp)  Cells ordered and grouped ); (“ clustered ”) by login timestamp  Cells are now tuple-value pairs Advantage: range queries!  [2014-01-29 , [2014-01-29 , [2014-01-30 , [2014-01-30 , alice ip_address]: agent]: ip_address]: agent]: 66.249.66.183 Firefox 208.115.113.86 Firefox … [2014-01-23 , [2014-01-23 , bob agent]: ip_address]: Chrome 205.29.190.116

Apache Cassandra for Big Data Applications Christof Roduner Java - PowerPoint PPT Presentation

Apache Cassandra for Big Data Applications Christof Roduner Java User Group Switzerland COO and co-founder January 7, 2014 christof@scandit.com AGENDA 2 Cassandra origins and use How we use Cassandra Data model and query language

Apache Cassandra STL Java Users Group Cliff Gilmore DataStax Solutions Architect / Engineer

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5

On Cassandra's evolution Berlin Buzzwords (June 4th 2013) Sylvain Lebresne Apache Cassandra

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Lessons Learned with Cassandra & Spark_ Matthias Niehoff Apache: Big Data 2017

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Introduction MPEG: A Video Compression Standard 1980s technology made possible full

Tricks and Traps for Young Players Ray D Brownrigg Statistical Computing Manager School of

Who is that guy? Sanne Grinovero From this planet T eam Hibernate

What Youll Learn Today Why do they call it a motion picture? What is digital

A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline

GHDL and the economy of EDA FOSS Tristan Gingold - FSiC 2019 1 / 21 What is GHDL ? A Free

Unix Essentials Devin J. Pohly <djpohly@cse.psu.edu> CMPSC 311: Introduction to Systems

Mixture Differential Cryptanalysis: a New Approach to Distinguishers and Attacks on round-reduced

Apache Cassandra for Big Data Applications Christof Roduner Java - PowerPoint PPT Presentation

Apache Cassandra for Big Data Applications Christof Roduner Java User Group Switzerland COO and co-founder January 7, 2014 christof@scandit.com AGENDA 2 Cassandra origins and use How we use Cassandra Data model and query language

Apache Cassandra STL Java Users Group Cliff Gilmore DataStax Solutions Architect / Engineer

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5

On Cassandra's evolution Berlin Buzzwords (June 4th 2013) Sylvain Lebresne Apache Cassandra

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Apex: Next Gen Big Data Analytics Thomas Weise &lt;thw@apache.org&gt; @thweise PMC Chair

Lessons Learned with Cassandra &amp; Spark_ Matthias Niehoff Apache: Big Data 2017

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Introduction MPEG: A Video Compression Standard 1980s technology made possible full

Tricks and Traps for Young Players Ray D Brownrigg Statistical Computing Manager School of

Who is that guy? Sanne Grinovero From this planet T eam Hibernate

What Youll Learn Today Why do they call it a motion picture? What is digital

A Tutorial By Sarita Adve &amp; Kourosh Gharachorloo Slides by Jim Larson Outline

GHDL and the economy of EDA FOSS Tristan Gingold - FSiC 2019 1 / 21 What is GHDL ? A Free

Unix Essentials Devin J. Pohly &lt;djpohly@cse.psu.edu&gt; CMPSC 311: Introduction to Systems

Mixture Differential Cryptanalysis: a New Approach to Distinguishers and Attacks on round-reduced

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Lessons Learned with Cassandra & Spark_ Matthias Niehoff Apache: Big Data 2017

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline

Unix Essentials Devin J. Pohly <djpohly@cse.psu.edu> CMPSC 311: Introduction to Systems