Cassandra Jonathan Ellis
Motivation ● Scaling reads to a relational database is hard ● Scaling writes to a relational database is virtually impossible ● … and when you do, it usually isn't relational anymore
The new face of data ● Scale out, not up ● Online load balancing, cluster growth ● Flexible schema ● Key-oriented queries ● CAP-aware
CAP theorem ● Pick two of Consistency, Availability, Partition tolerance
T wo famous papers ● Bigtable: A distributed storage system for structured data , 2006 ● Dynamo: amazon's highly available key- value store , 2007
T wo approaches ● Bigtable: “How can we build a distributed db on top of GFS?” ● Dynamo: “How can we build a distributed hash table appropriate for the data center?”
10,000 ft summary ● Dynamo partitioning and replication ● Log-structured ColumnFamily data model similar to Bigtable's
Cassandra highlights ● High availability ● Incremental scalability ● Eventually consistent ● T unable tradeoffs between consistency and latency ● Minimal administration ● No SPF
Dynamo architecture & Lookup
Architecture details ● O(1) node lookup ● Explicit replication ● Eventually consistent
Architecture layers Messaging service Commit log T ombstones Gossip Memtable Hinted handoff Failure detection SST able Read repair Cluster state Indexes Bootstrap Partitioner Compaction Monitoring Replication Admin tools
Writes ● Any node ● Partitioner ● Commitlog, memtable ● SST able ● Compaction ● Wait for W responses
Memtable / SST able Disk Commit log
SST able format ● Key / data
SST able Indexes ● Bloom filter ● Key ● Column (Similar to Hadoop MapFile / Tfile)
Compaction ● Merge keys ● Combine columns ● Discard tombstones
Remove ● Deletion marker (tombstone) necessary to suppress data in older SST ables, until compaction ● Read repair complicates things a little ● Eventually consistent complicates things more ● Solution: configurable delay before tombstone GC, after which tombstones are not repaired
Cassandra write properties ● No reads ● No seeks ● Fast ● Atomic within ColumnFamily ● Always writable
Read path ● Any node ● Partitioner ● Wait for R responses ● Wait for N – R responses in the background and perform read repair
Cassandra read properties ● Read multiple SST ables ● Slower than writes (but still fast) ● Seeks can be mitigated with more RAM ● Scales to billions of rows
Consistency in a BASE world ● If W + R > N, you will have consistency ● W=1, R=N ● W=N, R=1 ● W=Q, R=Q where Q = N / 2 + 1
vs MySQL with 50GB of data ● MySQL ● ~300ms write ● ~350ms read ● Cassandra ● ~0.12ms write ● ~15ms read ● Achtung!
Data model ● Rows, ColumnFamilies, Columns
ColumnFamilies keyA column1 column2 column3 keyC column1 column7 column11 Column Byte[] Name Byte[] Value I64 timestamp
Super ColumnFamilies keyF Super1 Super2 column column column column column column keyJ Super1 Super5 column column column column column column
T ypes of queries ● Single column ● Slice ● Set of names / range of names ● Simple slice -> columns ● Super slice -> supercolumns ● Key range
Range queries ● Add “master” server ● Implement on top of K/V ● Order-preserving partitioning
Modification ● Insert / update ● Remove ● Single column or batch ● Specify W, number of nodes to wait for
Thrift struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } struct SuperColumn { 1: binary name, 2: list<Column> columns, } Column get_column(table, key, column_path, block_for=1) list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100) void insert(table, key, column_path, value, timestamp, block_for=0) void remove(tablename, key, column_path_or_parent, timestamp)
Honestly, Thrift kinda sucks
Example: a multiuser blog T wo queries - the most recent posts belonging to a given blog, in reverse chronological order - a single post and its comments, in chronological order
First try JBE Cassandra is teh awesome BASE FTW blog post comment comment post comment comment Evan I like kittens And Ruby blog post comment comment post comment comment <ColumnFamily T ype="Super" CompareWith="TimeString" CompareSubcolumnsWith="UUID" Name="Blog"/>
Second try JBE blog Cassandra BASE FTW Cassandr comment comment is teh a is teh awesome awesome Evan blog I like kittens And Ruby Base FTW comment comment I like comment comment kittens And Ruby comment comment <ColumnFamily <ColumnFamily CompareWith="UUIDT ype" CompareWith="UUIDT ype" Name="Blog"/> Name="Comment"/>
Roadmap
Cassandra 0.3 ● Remove support ● OPP / Range queries ● T est suite ● Workarounds for JDK bugs ● Rudimentary multi-datacenter support
Cassandra 0.4 ● Branched May 18 ● Data file format change to support billions of rows per node instead of millions ● API changes (no more colon delimiters) ● Multi-table (keyspace) support ● LRU key cache ● fsync support ● Bootstrap ● Web interface
Cassandra 0.5 ● Bootstrap ● Load balancing ● Closely related to “bootstrap done right” ● Merkle tree repair ● Millions of columns per row ● This will require another data format change ● Multiget ● Callout support
Users Production: facebook, RocketFuel Production RSN: Digg, Rackspace No date yet: IBM Research, T witter Evaluating: 50+ in #cassandra on freenode
More ● Eventual consistency: http://www.allthingsdistributed.com/2008/1 ● Introduction to distributed databases by T odd Lipcon at NoSQL 09: http://www.vimeo.com/5145059 ● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAn ● #cassandra on irc.freenode.net
Cassandra
Recommend
More recommend