apache cassandra for big data applications
play

Apache Cassandra for Big Data Applications Christof Roduner Java - PowerPoint PPT Presentation

Apache Cassandra for Big Data Applications Christof Roduner Java User Group Switzerland COO and co-founder January 7, 2014 christof@scandit.com AGENDA 2 Cassandra origins and use How we use Cassandra Data model and query language


  1. Apache Cassandra for Big Data Applications Christof Roduner Java User Group Switzerland COO and co-founder January 7, 2014 christof@scandit.com

  2. AGENDA 2  Cassandra origins and use  How we use Cassandra  Data model and query language  Cluster organization  Replication and consistency  Practical experience

  3. WHAT IS CASSANDRA? 3 SQL

  4. WHAT IS CASSANDRA? 4 SQL not only

  5. ORIGINS 5 Dynamo BigTable distributed storage data model

  6. USED BY… 6

  7. SCANDIT 7 ETH Zurich startup company  Our mission: provide the best mobile barcode scanning platform  Customers: Bayer, Coop, CapitalOne, Saks 5 th Avenue, Nasa, …  Barcode scanning SDKs for:   iOS, Android  Phonegap  Titanium  Xamarin de Scanner SDK iOS v3.0.0 De

  8. SCANDIT 8

  9. THE SCANALYTICS PLATFORM 9 Two purposes: 1. External tool for app publishers: App-specific real-time usage statistics  Insights into user behavior  What do users scan?   Product categories? Groceries, electronics, books, cosmetics, …? Where do users scan?   At home? Or while in a retail store?  Top products and brands 2. Internal tool for our algorithms team : Improve our image processing algorithms  Detect devices and OS versions with camera issues  Monitor scan performance of our SDK 

  10. BACKEND REQUIREMENTS 12  Analysis of scans  Accept and store high volumes of scans  Keep history of billions of camera parameters  Generate statistics over extended time periods  Provide reports to developers

  11. BACKEND DESIGN GOALS 13  Scalability  High-volume storage  High-volume throughput  Support large number of concurrent client requests (mobile devices)  Availability  Low maintenance  Even as our customer base grows  Multiple data centers

  12. WHY DID WE CHOOSE CASSANDRA? 14 Partitioning A..J K..R S..Z

  13. WHY DID WE CHOOSE CASSANDRA? 15 Simplicity Coordi- Master Slave nator

  14. MORE REASONS… 16  Looked very fast  Even when data is much larger than RAM  Performs well in write-heavy environment  Proven scalability  Without downtime  Tunable replication  Data model  YMMV…

  15. WHAT YOU HAVE TO GIVE UP 17  Joins  Referential integrity  Transactions  Expressive query language (nested queries, etc.)  Consistency (tunable, but not by default…)  Limited support for secondary indices

  16. HELLO CQL 18 CREATE TABLE users ( username TEXT, email TEXT, web TEXT, phone TEXT, PRIMARY KEY (username) );

  17. HELLO CQL 19 INSERT INTO users (username, email, phone) VALUES ('alice', CREATE TABLE users ( 'alice@example.com', username TEXT, '123-456-7890'); email TEXT, web TEXT, INSERT INTO users (username, email, web) phone TEXT, VALUES ('bob', PRIMARY KEY (username) 'bob@example.com', ); 'www.example.com');

  18. HELLO CQL 20 INSERT INTO users (username, email, phone) VALUES ('alice', CREATE TABLE users ( 'alice@example.com', username TEXT, '123-456-7890'); email TEXT, web TEXT, INSERT INTO users (username, email, web) phone TEXT, VALUES ('bob', PRIMARY KEY (username) 'bob@example.com', ); 'www.example.com'); cqlsh:demo> SELECT * FROM users; username | email | phone | web ----------+-------------------+--------------+----------------- bob | bob@example.com | null | www.example.com alice | alice@example.com | 123-456-7890 | null

  19. FAMILIAR… BUT DIFFERENT 21 CREATE TABLE users ( username TEXT, email TEXT, web TEXT, No auto increments phone TEXT, PRIMARY KEY (username) (use natural key or ); UUID instead) Primary key always mandatory

  20. FAMILIAR… BUT DIFFERENT 22 cqlsh:demo> SELECT * FROM users; username | email | phone | web ----------+-------------------+--------------+----------------- bob | bob@example.com | null | www.example.com alice | alice@example.com | 123-456-7890 | null

  21. FAMILIAR… BUT DIFFERENT 23 CREATE TABLE users ( username TEXT, email TEXT, Sort order? web TEXT, phone TEXT, PRIMARY KEY (username) ); cqlsh:demo> SELECT * FROM users; username | email | phone | web ----------+-------------------+--------------+----------------- bob | bob@example.com | null | www.example.com alice | alice@example.com | 123-456-7890 | null

  22. UNDER THE HOOD: CLUSTER ORGANIZATION 24 Range 1-64, Node 1 stored on node 2 Token 0 Node 4 Node 2 Token 192 Token 64 Range 65-128, Node 3 stored on node 3 Token 128

  23. STORING A ROW 25 Range 1-64, stored on node 2 Calculate md5 hash for row key 1. (the “username” field in the example above) Example : md5(“alice") = 48 Node 1 Token 0 Determine data range for hash 2. Example : 48 lies within range 1-64 Node 2 Node 4 Store row on node responsible 3. Token 64 Token 192 for range Example : store on node 2 Node 3 Token 128 Range 65-128, stored on node 3

  24. IMPLICATIONS 26  Cluster automatically balanced  Load is shared equally between nodes  No hotspots  Scaling out?  Easy  Divide data ranges by adding more nodes  Cluster rebalances itself automatically  Range queries not possible  You can’t retrieve «all rows from A-C»  Rows are not stored in their «natural» order  Rows are stored in order of their md5 hashes

  25. FAMILIAR… BUT DIFFERENT 27 CREATE TABLE users ( username TEXT, email TEXT, Sort order? web TEXT, phone TEXT, PRIMARY KEY (username) ); cqlsh:demo> SELECT * FROM users; username | email | phone | web ----------+-------------------+--------------+----------------- bob | bob@example.com | null | www.example.com alice | alice@example.com | 123-456-7890 | null

  26. UNDER THE HOOD: PHYSICAL STORAGE 28 A physical row stores data in name-  INSERT INTO users value pairs (“cells”) (username, email, phone) VALUES ('alice',  Cell name is CQL field name (e.g. “email”) 'alice@example.com',  Cell value is field data (e.g. “bob@example.com”) '123-456-7890'); Cells in row are automatically sorted INSERT INTO users  (username, email, web) VALUES by name (“email” < “phone” < “web”) ('bob', Cell names can be different in rows 'bob@example.com',  'www.example.com'); Up to 2 billion cells per row  email: web: bob bob@example.com www.example.com email: phone: Physical row with alice alice@example.com 123-456-7890 row key “alice”

  27. FAMILIAR… BUT DIFFERENT 29 CREATE TABLE users ( username TEXT, email TEXT, Sort order? web TEXT, phone TEXT, PRIMARY KEY (username) ); cqlsh:demo> SELECT * FROM users; username | email | phone | web ----------+-------------------+--------------+----------------- bob | bob@example.com | null | www.example.com alice | alice@example.com | 123-456-7890 | null

  28. TWO BILLION CELLS 30 CREATE TABLE users ( username TEXT, email TEXT, Who needs 2 billion web TEXT, phone TEXT, fields in a table?!? address TEXT, spouse TEXT, hobbies TEXT, … hair_color TEXT, favorite_dish TEXT, pet_name TEXT, favorite_bands TEXT, … two_billionth_field TEXT, PRIMARY KEY (username) );

  29. 2 BILLION CELLS: WIDE ROWS 31 Use case: track logins of users  Data model:   One (wide) physical row per user  User name as row key  Login details (time, IP address, user agent) in cells  Cells ordered and grouped (“ clustered ”) by login timestamp  Cells are now tuple-value pairs Advantage: range queries!  [2014-01-29 , [2014-01-29 , [2014-01-30 , [2014-01-30 , alice ip_address]: agent]: ip_address]: agent]: 66.249.66.183 Firefox 208.115.113.86 Firefox … [2014-01-23 , [2014-01-23 , bob agent]: ip_address]: Chrome 205.29.190.116

  30. 2 BILLION CELLS: WIDE ROWS 32 Use case: track logins of users  CREATE TABLE logins ( username TEXT, Data model:  timestamp TIMESTAMP,  One (wide) physical row per user ip_address TEXT,  User name as row key agent TEXT,  Login details (time, IP address, user agent) in cells PRIMARY KEY (username, timestamp)  Cells ordered and grouped ); (“ clustered ”) by login timestamp  Cells are now tuple-value pairs Advantage: range queries!  [2014-01-29 , [2014-01-29 , [2014-01-30 , [2014-01-30 , alice ip_address]: agent]: ip_address]: agent]: 66.249.66.183 Firefox 208.115.113.86 Firefox … [2014-01-23 , [2014-01-23 , bob agent]: ip_address]: Chrome 205.29.190.116

  31. QUERYING THE LOGINS 33 INSERT INTO logins (username, timestamp, ip_address, agent) VALUES ('alice', '2014-01-29 16:22:30 +0100', '208.115.113.86', 'Firefox'); cqlsh:demo> SELECT * FROM logins; username | timestamp | agent | ip_address ----------+--------------------------+---------+----------------- bob | 2014-01-23 01:12:49+0100 | Chrome | 205.29.190.116 alice | 2014-01-29 16:22:30+0100 | Firefox | 208.115.113.86 alice | 2014-01-30 07:48:03+0100 | Firefox | 66.249.66.183 alice | 2014-01-30 18:06:55+0100 | Firefox | 208.115.111.70 alice | 2014-01-31 12:37:26+0100 | Firefox | 66.249.66.183

Recommend


More recommend