BEN COVERSTON DSE ARCHITECT — DATASTAX INC. @BCOVERSTON NOSQL UNDER THE HOOD: THE ANATOMY AND EVOLUTION OF CASSANDRA
THE GRADUAL DEVELOPMENT OF SOMETHING, ESPECIALLY FROM A SIMPLE TO A MORE COMPLEX FORM. Evolution
A STUDY OF THE STRUCTURE OR INTERNAL WORKINGS OF SOMETHING. Anatomy
EX NIHILO IS A LATIN PHRASE MEANING "OUT OF NOTHING" Ex Nihilo
THE ANATOMY AND EVOLUTION OF CASSANDRA CASSANDRA WAS NOT CREATED EX NIHILO
THE ANATOMY AND EVOLUTION OF CASSANDRA
THE ANATOMY AND EVOLUTION OF CASSANDRA WITH SO MANY OPTIONS, WHY CASSANDRA? ▸ Big Table could Scale Out, Very Well ▸ And provide a flexible data model ▸ But it wasn’t great at High Availability ▸ Dynamo Could also Scale Out ▸ But High Availability was its biggest strength ▸ Extreme resilience under exceptionally hostile conditions ▸ But… mostly a key value store
THE POSITION OR FUNCTION OF AN ORGANISM IN A COMMUNITY OF PLANTS AND ANIMALS. Niche
THE ANATOMY AND EVOLUTION OF CASSANDRA SO DYNAMO AND BIG TABLE HAD A BABY? WHY? ▸ Originally to fill a Niche ▸ Facebook Inbox Search ▸ When the project was complete, they open sourced it. ▸ July 2008 — Google Code
THE ANATOMY AND EVOLUTION OF CASSANDRA I NEED A DISTRIBUTED DATABASE. A REAL DISTRIBUTED DATABASE … SO, I'M WORKING ON CASSANDRA. Jonathan Ellis - March 27th 2009
THE ANATOMY AND EVOLUTION OF CASSANDRA THE SPARK OF LIFE ▸ Rackspace needed a metadata store for CloudFiles ▸ An engineer needs a real distributed database that is both Partition Tolerant and Highly Available ▸ Cassandra entered the Apache Incubator in March 2009
THE ANATOMY AND EVOLUTION OF CASSANDRA FUNDAMENTAL ANATOMICAL STRENGTHS (CIRCA 2009) ▸ P2P —> No Single Point of Failure ▸ Easy to understand replication model ▸ Focus on Availability and Partition Tolerance ▸ API that allows for more than just Java clients (Thrift) ▸ Raw access to the file system on the individual machines (not abstracted away into HDFS) ▸ Good performance for mixed workloads, and larger-than- memory workloads.
THE ANATOMY AND EVOLUTION OF CASSANDRA WHAT’S DANGEROUS IS NOT TO EVOLVE Jeff Besos
THE ANATOMY AND EVOLUTION OF CASSANDRA MAJOR ANATOMICAL STRUCTURES ▸ Log Structured Storage ▸ Data Versioning ▸ Replication (on Write) ▸ Anti-Entropy Repair ▸ Read Repair ▸ Consistent Hashing
THE ANATOMY AND EVOLUTION OF CASSANDRA EVOLUTION OF LOG STRUCTURED STORAGE
THE ANATOMY AND EVOLUTION OF CASSANDRA BTREES / ISAM ▸ Pros ▸ Querying is often fast and easy ▸ Support for referential integrity ▸ You need a master in a distributed system: order matters ▸ Cons ▸ Reads and writes are tightly coupled ▸ Writes have to seek before an update happens (RbW) ▸ Index Structures (if used) are Expensive ▸ Indexes, and most of your data needs to be in your working set for good performance.
THE ANATOMY AND EVOLUTION OF CASSANDRA Seek for Alice’s Record Alice Alice Liddell Carroll Update record in place Bob Bob Parr Parr
THE ANATOMY AND EVOLUTION OF CASSANDRA WHY LOG STRUCTURED STORAGE ▸ Pros: ▸ Writes are fast, they never wait for locks ▸ No locking means you don’t need a master ▸ Cons: ▸ Reads may require a merge step ▸ Background Compaction
THE ANATOMY AND EVOLUTION OF CASSANDRA Alice Liddell Record has a timestamp Bob Parr Write a new record Alice Carroll Merge Records Alice Liddell Bob Parr
THE ANATOMY AND EVOLUTION OF CASSANDRA DATA VERSIONING
THE ANATOMY AND EVOLUTION OF CASSANDRA IN PLACE UPDATES ▸ Very Common in Single Server or in Master/Slave systems ▸ Requires coordination ▸ Coordination is expensive, error prone. ▸ If you choose this, you’re choosing a CP system.
THE ANATOMY AND EVOLUTION OF CASSANDRA VERSIONED UPDATES ▸ Can be easily distributed ▸ Versions can be chosen by the client or the server ▸ Reconciliation can happen later. ▸ This can be expensive ▸ Scales, because there is no need for coordination.
THE ANATOMY AND EVOLUTION OF CASSANDRA TIMESTAMPS ▸ Assign a timestamp at write time ▸ Client or Server can reconcile at read ▸ To varying degrees (Consistency Level) ▸ Relatively Simple and Straightforward ▸ Clocks have to be synchronized ▸ Old data gets overwritten by new data.
THE ANATOMY AND EVOLUTION OF CASSANDRA Alice Liddell 1234 Record has a timestamp Bob Parr 2345 Write a new record Alice Carroll 3456 Merge Records Alice Carroll 3456 Bob Parr 2345
THE ANATOMY AND EVOLUTION OF CASSANDRA VECTOR CLOCKS ▸ Data is decorated with two pieces of data: ▸ Where the update happened (node) ▸ A Sequence Number ▸ Some updates can be reconciled ▸ Some updates must be manually reconciled ▸ This is !FUN!, especially for clients that get multiple versions back for a client request.
THE ANATOMY AND EVOLUTION OF CASSANDRA Record has a version Alice Liddell 1 n1 Bob Parr 2 n1 and a node identifier Write a new record on Alice Carroll 1 n2 another node Merge Records Alice Liddell 1 n1 Alice Carroll 3 n2 Bob Parr 2 n1 * reductionist, many details no included
THE ANATOMY AND EVOLUTION OF CASSANDRA EVOLUTION OF REPLICATION
THE ANATOMY AND EVOLUTION OF CASSANDRA Slave Slave Synchronous Replication Writes Master
THE ANATOMY AND EVOLUTION OF CASSANDRA MASTER - SLAVE REPLICATION ▸ Pros ▸ Consistent ▸ Doesn’t require versioning ▸ Real-Time Knowledge about Replication Status ▸ Cons ▸ Updates happen at the master first ▸ Consistent reads happen at the master ▸ Single Point of Failure ▸ Failover modes are complicated
THE ANATOMY AND EVOLUTION OF CASSANDRA A-E F-J K-M N-Q R-T U-Z
THE ANATOMY AND EVOLUTION OF CASSANDRA MULTI-MASTER REPLICATION ▸ Shard the Data over Multiple Masters ▸ Pros ▸ Failures are smaller in scope ▸ Cons ▸ Different masters own different ranges ▸ Failover Machinery also has to be duplicated ▸ SPOF still exists for every range
THE ANATOMY AND EVOLUTION OF CASSANDRA client
THE ANATOMY AND EVOLUTION OF CASSANDRA PEER TO PEER REPLICATION ▸ Pros ▸ Failover is simple ▸ Write anywhere ▸ No master ▸ Can choose consistency levels ▸ Cons ▸ Writes are not guaranteed to happen at every replica at write time ▸ Anti Entropy can be expensive, and is a major consideration in sizing for individual nodes.
THE ANATOMY AND EVOLUTION OF CASSANDRA EVOLUTION OF ANTI-ENTROPY
THE ANATOMY AND EVOLUTION OF CASSANDRA WHY ANTI-ENTROPY ▸ Mostly a Peer to Peer Problem ▸ None of the AE systems are perfect ▸ Multiple compensating mechanisms increase the time between updates and a consistent view of the system. ▸ Evolutionary path has been mostly iterative, lessons have been learned along the way.
THE ANATOMY AND EVOLUTION OF CASSANDRA BACKUP / RESTORE ▸ It works ▸ If you have tested it ▸ If the backup is not corrupt ▸ Requires operational discipline ▸ Requires time ▸ Most systems have to be down for a restore.
THE ANATOMY AND EVOLUTION OF CASSANDRA HINTED HANDOFF ▸ Missed Updates Revisited Later ▸ Part of the Dynamo Paper ▸ Problems With Early Versions of Hinted Handoff ▸ Stampeding Herd ▸ Increased Load on Delivery ▸ Storage (How Many and Where?)
HINTED HANDOFF Alice Bob Alice Bob X N3 Alice Alice N3 Bob Bob N3 Chuck Chuck Chuck N3 Dave Dave Dave N3 Eve Eve Eve N3 Frank Frank Frank N3 Gus Gus Gus
THE ANATOMY AND EVOLUTION OF CASSANDRA READ REPAIR ▸ Easy and cheap when you are already reading from multiple replicas ▸ Probabilistic repair when only reading from a single replica ▸ Hot data will be repaired frequently
READ AT CONSISTENCY LEVEL 1 Bob, 1 Bob, 1 Robert, 2
READ AT CONSISTENCY LEVEL 1 Bob, 1 Bob, 1 Bob, 1 Robert, 2
CHECK OTHER REPLICAS AT PROBABILITY 0.1, OR SOME OTHER VALUE Get Digest Bob, 1 Bob, 1 Get Digest Robert, 2
UPDATE COORDINATOR WITH LATEST RECORD FROM DIGEST Robert, 2 Bob, 1 Robert, 2
REPAIR OUT OF DATE REPLICAS Update Replica Robert, 2 Robert, 2 Robert, 2
READ AT CONSISTENCY LEVEL 1 Robert, 2 Robert, 2 Robert , 2 Robert, 2
TEXT REPAIR ▸ Similar to rsync ▸ How Repair Works* ▸ Expensive, Slow 1. Generate Merkle Trees ▸ Incremental Repair is a 2. Compare Differences huge (late) improvement 3. Stream Differences in manageability.
HASHES FOR EACH RANGE, ARRANGE IN A HASH TREE 01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12
FIND THE HASHES THAT DON’T MATCH 01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12
LEAF NODES REPRESENT DATA THAT NEEDS TO BE EXCHANGED 01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12
THE ANATOMY AND EVOLUTION OF CASSANDRA EVOLUTION OF SHARDING
THE ANATOMY AND EVOLUTION OF CASSANDRA RANGE SHARDING ▸ Pros ▸ Generally good for short term problems ▸ Range Scanning ▸ Cons ▸ Need to track shards, repartitioning ▸ Second System problem ▸ Data is “lumpy”
RANGE SHARDING A-E F-J K-M N-Q R-T U-Z
Recommend
More recommend