OVERVIEW OF CASSANDRA WHY WOULD YOU NAME A DATABASE AFTER A GREEK MYTH OF NOT BEING LISTENED TO?
AGENDA • Bio • Monitoring • Basic C* Data Model • Version issues and • Replication vs. tombstoning Quorum • Maintenance Tasks • Failure Recovery • The Dark Side • AWS Implications
QUICK BIO • Programming since 1981 • Four patents • 2010 JavaOne Rock Star and Duke’s Choice winner • Frequent contributor to Pragmatic Programmer magazine, SearchAws.com and LinkinPulse News • briantarbox.org, log4jfugue.org, BrianTarbox@gmail.com
C* DATA MODEL; COMPARISON • “ When all you have is a hammer, everything looks like a thumb. ” - Morgan • Relational (tabular) model: Postgresql • Relationship (graph) model: Neo4J • Document model: Mongo • Time-series model : C*
THE REAL DIFFERENCE BETWEEN SQL AND “NO-SQL” • In SQL we’re trained to design based on storing the data, ideally in 3rd Normal form. Queries are bolted on later. • In no-SQL we design based on the queries we’ll perform. “Table” structure falls out of that. • Queries should get top billing b/c if you just store the data who cares?
C* DATA MODEL, EXAMPLE • Wide rows; wide columns, heterogeneous columns • For example, a row per stock, with each column being all we know about that stock for that day. • Designed to be easy to “select” a row and then read thousands of columns sequentially • Not designed to randomly select specific columns
CQL, SLICE PREDICATES • In postgres you might say “select * from stock where ticker=“IBM” and price > 100” • You simply can not do that with C* • SQL uses indexes to speed up access to rows; indexes are very problematic in C* • Often the C* answer is denormalization
SLICE PREDICATES • Columns have names (e.g. “date”, but columns can also contain many (hundreds) of values. • Slice predicates let you specify which columns to select
CLIQUE, INC.: C* ANTIPATTERN • My last company folded, but not before providing a C* anti pattern • Collaboration software; many ad-hoc queries (who’s in what context, where was “x” said, etc) • We ended up with 14 copies of the main data, each in its own column-family. • Bad Dog.
REPLICATION VS. READ/WRITE LEVEL • Replication refers to how many distinct copies of the data there are • Read/Write Level refers to how many of the replicas must respond/agree before proceeding
THE WRITE PATH • Client picks C* node at random, it becomes the Coordinator, etc. (diagram), send to replica # of nodes, wait til ’n’ respond before returning
THE READ PATH • diagram (coordinator, send to all nodes with data, wait for ’n’ to respond) • Read Repair
FAILURE RECOVERY - WHERE C* REALLY SHINES • What happens when a node fails? • How many nodes can fail w/o data loss depends on # nodes and #replicas • Auto-recovery vs. backup and restore • With the usual caveats… C* recovery “ just works ”
RUNNING C* ON AWS • Scale out not up • More spindles is better • Log dir vs. data dir • Selecting the right instance type • You must run with NTP (not an AWS standard)
CONFIGURATION • The main C* config file is 700 lines long • You really need to deeply understand most of it. • cluster_name, listen_address, commitlog_directory, endpoint_snitch, seed_provider, compaction_throughput_mb_per_sec, concurrent_reads, snapshot_before_compaction, phi_convict_threshold, commitlog_sync, partitioner, key_cache_size_in_mb, row_cache_save_period, tombstone_warn_threshold, read_request_timeout_in_ms, cross_node_timeout, internode_compression, inter_dc_tcp_nodelay, dynamic_snitch_badness_threshold, dynamic_snitch_update_interval, hinted_handoff_enabled, max_hints_delivery_threads,…..
MONITORING YOUR C* CLUSTER
VERSION ISSUES AND TOMBSTONES • Life is better if you never delete records • If you delete you can end up with tombstones • To deal with tombstones you need to run Repair… and that is a whole nasty can of worms.
MAINTENANCE TASKS • Full and minor compressions • snapshot your disks if using AWS/EBS
THE DARK SIDE, PART 1 • Datastax maintains three parallel release branches, with vastly different feature sets • New releases are always unstable; never accept an n.0, n.1 or n.2 release
THE DARK SIDE, PART 2 • C* uses schema-less design • Requires knowledge of slice predicates rather than SQL • DataStax decided to adopt schema and CQL to gain marketshare at the expense of their soul. • You can now pretend C* is relational (except no indexes and mostly no where clauses)
Recommend
More recommend