cassandra
play

Cassandra Jonathan Ellis Motivation Scaling reads to a relational - PowerPoint PPT Presentation

Cassandra Jonathan Ellis Motivation Scaling reads to a relational database is hard Scaling writes to a relational database is virtually impossible and when you do, it usually isn't relational anymore The new face of data


  1. Cassandra Jonathan Ellis

  2. Motivation ● Scaling reads to a relational database is hard ● Scaling writes to a relational database is virtually impossible ● … and when you do, it usually isn't relational anymore

  3. The new face of data ● Scale out, not up ● Online load balancing, cluster growth ● Flexible schema ● Key-oriented queries ● CAP-aware

  4. CAP theorem ● Pick two of Consistency, Availability, Partition tolerance

  5. T wo famous papers ● Bigtable: A distributed storage system for structured data , 2006 ● Dynamo: amazon's highly available key- value store , 2007

  6. T wo approaches ● Bigtable: “How can we build a distributed db on top of GFS?” ● Dynamo: “How can we build a distributed hash table appropriate for the data center?”

  7. 10,000 ft summary ● Dynamo partitioning and replication ● Log-structured ColumnFamily data model similar to Bigtable's

  8. Cassandra highlights ● High availability ● Incremental scalability ● Eventually consistent ● T unable tradeoffs between consistency and latency ● Minimal administration ● No SPF

  9. Dynamo architecture & Lookup

  10. Architecture details ● O(1) node lookup ● Explicit replication ● Eventually consistent

  11. Architecture layers Messaging service Commit log T ombstones Gossip Memtable Hinted handoff Failure detection SST able Read repair Cluster state Indexes Bootstrap Partitioner Compaction Monitoring Replication Admin tools

  12. Writes ● Any node ● Partitioner ● Commitlog, memtable ● SST able ● Compaction ● Wait for W responses

  13. Memtable / SST able Disk Commit log

  14. SST able format ● Key / data

  15. SST able Indexes ● Bloom filter ● Key ● Column (Similar to Hadoop MapFile / Tfile)

  16. Compaction ● Merge keys ● Combine columns ● Discard tombstones

  17. Remove ● Deletion marker (tombstone) necessary to suppress data in older SST ables, until compaction ● Read repair complicates things a little ● Eventually consistent complicates things more ● Solution: configurable delay before tombstone GC, after which tombstones are not repaired

  18. Cassandra write properties ● No reads ● No seeks ● Fast ● Atomic within ColumnFamily ● Always writable

  19. Read path ● Any node ● Partitioner ● Wait for R responses ● Wait for N – R responses in the background and perform read repair

  20. Cassandra read properties ● Read multiple SST ables ● Slower than writes (but still fast) ● Seeks can be mitigated with more RAM ● Scales to billions of rows

  21. Consistency in a BASE world ● If W + R > N, you will have consistency ● W=1, R=N ● W=N, R=1 ● W=Q, R=Q where Q = N / 2 + 1

  22. vs MySQL with 50GB of data ● MySQL ● ~300ms write ● ~350ms read ● Cassandra ● ~0.12ms write ● ~15ms read ● Achtung!

  23. Data model ● Rows, ColumnFamilies, Columns

  24. ColumnFamilies keyA column1 column2 column3 keyC column1 column7 column11 Column Byte[] Name Byte[] Value I64 timestamp

  25. Super ColumnFamilies keyF Super1 Super2 column column column column column column keyJ Super1 Super5 column column column column column column

  26. T ypes of queries ● Single column ● Slice ● Set of names / range of names ● Simple slice -> columns ● Super slice -> supercolumns ● Key range

  27. Range queries ● Add “master” server ● Implement on top of K/V ● Order-preserving partitioning

  28. Modification ● Insert / update ● Remove ● Single column or batch ● Specify W, number of nodes to wait for

  29. Thrift struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } struct SuperColumn { 1: binary name, 2: list<Column> columns, } Column get_column(table, key, column_path, block_for=1) list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100) void insert(table, key, column_path, value, timestamp, block_for=0) void remove(tablename, key, column_path_or_parent, timestamp)

  30. Honestly, Thrift kinda sucks

  31. Example: a multiuser blog T wo queries - the most recent posts belonging to a given blog, in reverse chronological order - a single post and its comments, in chronological order

  32. First try JBE Cassandra is teh awesome BASE FTW blog post comment comment post comment comment Evan I like kittens And Ruby blog post comment comment post comment comment <ColumnFamily T ype="Super" CompareWith="TimeString" CompareSubcolumnsWith="UUID" Name="Blog"/>

  33. Second try JBE blog Cassandra BASE FTW Cassandr comment comment is teh a is teh awesome awesome Evan blog I like kittens And Ruby Base FTW comment comment I like comment comment kittens And Ruby comment comment <ColumnFamily <ColumnFamily CompareWith="UUIDT ype" CompareWith="UUIDT ype" Name="Blog"/> Name="Comment"/>

  34. Roadmap

  35. Cassandra 0.3 ● Remove support ● OPP / Range queries ● T est suite ● Workarounds for JDK bugs ● Rudimentary multi-datacenter support

  36. Cassandra 0.4 ● Branched May 18 ● Data file format change to support billions of rows per node instead of millions ● API changes (no more colon delimiters) ● Multi-table (keyspace) support ● LRU key cache ● fsync support ● Bootstrap ● Web interface

  37. Cassandra 0.5 ● Bootstrap ● Load balancing ● Closely related to “bootstrap done right” ● Merkle tree repair ● Millions of columns per row ● This will require another data format change ● Multiget ● Callout support

  38. Users Production: facebook, RocketFuel Production RSN: Digg, Rackspace No date yet: IBM Research, T witter Evaluating: 50+ in #cassandra on freenode

  39. More ● Eventual consistency: http://www.allthingsdistributed.com/2008/1 ● Introduction to distributed databases by T odd Lipcon at NoSQL 09: http://www.vimeo.com/5145059 ● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAn ● #cassandra on irc.freenode.net

  40. Cassandra

Recommend


More recommend