- Cassandra for Time Series Data - Joris Gillis, June 28, 2017 1
Joris Gillis I am a software engineer at TrendMiner and focus on the enterprise scalability of our industrial analytics platform. I studied at the University of Hasselt and the University of Antwerp in the field of Database theory and Data Mining. My interests are: • Big Data technology • Functional Programming • Athletics 2
Agenda 1. Introduction 2. Why Cassandra? 3. How to model time series in Cassandra? 4. How to configure Cassandra for Time Series Data? 5. Q&A 3
About TrendMiner TrendMiner is the Leading Modelling Free Industrial Analytics Platform to Analyze, Monitor and Predict Asset and Process Performance. With a proven track record in the (Petro-) Chemical and Oil & Gas industry to increase overall profitability, by improving production yield, lower costs, avoid unplanned process downtime, increase overall equipment efficiency and reduce safety risks. 4
About TrendMiner About our company • Started in 2008 as Spin-off K.U.Leuven • > 70 Man Year research behind TrendMiner • Spin-out Idea from Bayer MaterialScience (Covestro) • Patented several core technologies for US/EU • Headquarter EMEA, Hasselt, Belgium • Headquarter US, Houston, TX • 60+ Employees and growing • Global OSIsoft PI ISV & OEM partner • Platform Agnostic vendor • Front runner in both Process & Asset Analytics 5
Industry 4.0 Internet of Things Augmented Reality Wearables CONNECTIVITY Cyber Security Machine Learning Optimization & Prediction BIG DATA & ANALYTICS Additive Manufacturing Advanced Materials Autonomous Robotics ADVANCED MANUFACTURING Technologies that enable new ways of working and of doing business 6
About TrendMiner About our software 7
About TrendMiner Analyze 8
About TrendMiner Monitor 9
About TrendMiner Predict 10
Problem statement Time series Complex analyses Plants across the • From thousands to millions globe • Resolution between 5 minutes and 1 second • E.g., 10 year history 11
What is a Time Series? • A time series is a series of timestamped data points • Sometimes data points are spaced equidistantly • List<Tuple<Long, Float �>? 12
Agenda 1. Introduction 2. Why Cassandra? 3. How to model time series in Cassandra? 4. How to configure Cassandra for Time Series Data? 5. Q&A 13
Why Cassandra New technology Horizontal scaling Big Data => Big Index In-store analytics too limited for our needs Only HTTP interface
Why Cassandra • DISADVANTAGES • ADVANTAGES • Overhead vs custom optimised format • Proven technology • No time series specific optimisations • Connectivity • E.g., Gorilla/Beringei • JDBC connector (also to Spark) • Delta-delta encoding for semi- • Edge locations equidistant points • Geographic distribution of data • Delta encoding for stable values • Support for storing and querying time series data • E.g., KairosDB uses Cassandra as underlying store
Agenda 1. Introduction 2. Why Cassandra? 3. How to model time series in Cassandra? 4. How to configure Cassandra for Time Series Data? 5. Q&A 16
How to model time series in Cassandra? Keys • Primary key • One or more columns identifying a row • PRIMARY KEY (A) • PRIMARY KEY (A, B) • Compound primary key • Partition key • First column(s) of primary key • E.g., PRIMARY KEY ((A, B), C) • A & B are composite partition key 17
How to model time series in Cassandra? Partitioning & Clustering • A partition is mapped to a Cassandra node • All rows with same partition key on same node(s) • Clustering columns • Part of compound primary key • Define sorting inside partition Map<byte[], SortedMap<Clustering, Row Partition Key Clustering columns Other columns 18
How to model time series in Cassandra? Modelling: Simple CREATE TABLE temperature ( weatherstation_id uuid, event_time timestamp, temperature float, PRIMARY KEY (weatherstation_id, event_time) ); https://academy.datastax.com/resources/getting-started-time-series-data-modeling 19
How to model time series in Cassandra? Modelling: Simple • Advantage SELECT temperature FROM temperature • Easy to understand WHERE weatherstation_id='1234ABCD' AND event_time > '2013-04-03 07:01:00' AND event_time < '2013-04-03 07:04:00'; • Simple to query • Disadvantage • All data for one time series in one partition • Max 2 billion rows per partition 20
How to model time series in Cassandra? Modelling: Partitioned CREATE TABLE temperature_by_day ( weatherstation_id uuid, day date, event_time timestamp, temperature float, PRIMARY KEY ((weatherstation_id, date), event_time) ); https://academy.datastax.com/resources/getting-started-time-series-data-modeling 21
How to model time series in Cassandra? Modelling: Partitioned • Advantage • Virtually no storage limitation • Disadvantage • Crossing bucket boundary => multiple queries • Need to specify id and day; otherwise unpredictable performance • If data comes in burst => uneven partition sizes 22
Agenda 1. Introduction 2. Why Cassandra? 3. How to model time series in Cassandra? 4. How to configure Cassandra for Time Series Data? 5. Q&A 23
How to configure Cassandra for Time Series Data? • Cassandra v3 • Storage engine refactored compared to v2 • Options to influence read and write performance • Compression • Compaction 24
How to configure Cassandra for Time Series Data? How Cassandra Writes Data MemTable • Commit log row1 Write row2 Data row3 • Durability Memory Flush • Memtable Disk • Cache writes in memory Lorem ipsum dolor sit row1 amet, consectetur row1 adipiscing elit, sed do row1 Index row2 eiusmod tempor • Regularly flushed to disc Index row2 incididunt ut labore et Index row2 dolore magna aliqua. row3 Ut enim ad minim row3 veniam, quis nostrud row3 • Sorted Strings Table (SSTable) Commit Log SSTables • Compaction Compaction • Re-organise http://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html • Cleanup 25
How to configure Cassandra for Time Series Data? How Cassandra Maintains Data • SSTable = immutable • Updates => Timestamped version of row • Deletes => Tombstones • Clean up old versions and tombstones • Compaction 26
How to configure Cassandra for Time Series Data? How Cassandra Reads Data (Simplified) • How to get relevant data • SSTables • Partition key => node(s) • Bloom filter Partition Key • Returns list of SSTables that might contain rows for query • Partition index on SSTable • Keeps o ff set for each partition key row1 row1 Bloom row1 Index row2 • Memtable Index row2 Index row2 row3 Filter row3 row3 • Extract qualifying rows • Resolve • Timestamped rows • Tombstones 27
How to configure Cassandra for Time Series Data? Compaction options • Size tiered compaction (default) • Levelled compaction • Time window compaction 28
How to configure Cassandra for Time Series Data? Size tiered Compaction • Default strategy • Optimised for write heavy workloads • Compaction • When # similarly sized SSTables • Merge into one new file 29
How to configure Cassandra for Time Series Data? Size tiered Compaction Example Total size: T4 T1 T3 T2 625MB 150MB 155MB 155MB 165MB Compaction T5 600MB 30
How to configure Cassandra for Time Series Data? Size tiered Compaction Example T6 T7 155MB 165MB T5 600MB 31
How to configure Cassandra for Time Series Data? Size tiered Compaction • Advantage • Write optimised • Disadvantage • Rows of a partition are spread across multiple SSTables • Holds on to stale data for a long time • A lot of memory needed as SSTables grow in size 32
How to configure Cassandra for Time Series Data? Levelled Compaction • L(0) • Flushes from memtable • L(N > 0) • Fixed size SSTables (default: 160MB) • Each SSTable has range of partitions => NO OVERLAP! • L(1) holds at most 10 SSTables • L(N+1) can hold 10x more SSTables than L(N) 33
How to configure Cassandra for Time Series Data? Levelled Compaction Example 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction Row Partition 0 2 -1 1 r1 #0 r2 #2 L(0) r3 r4 r1 r2 r3 #-1 r4 #1 1 2 -1 0 r5 #-2 L(1) r6 #6 r4 r2 r3 r1 r7 #3 r8 #6 r9 #-1 L(2) r10 #-2 34
How to configure Cassandra for Time Series Data? Example ctd. 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction Row Partition 3 6 -2 6 r1 #0 r2 #2 L(0) r5 r6 r7 r8 r3 #-1 r4 #1 1 2 -2 -1 0 1 2 3 6 6 -1 0 r5 #-2 L(1) r6 #6 r4 r2 r5 r3 r1 r4 r2 r7 r6 r8 r3 r1 r7 #3 r8 #6 r9 #-1 L(2) r10 #-2 35
How to configure Cassandra for Time Series Data? Example ctd. 2 rows per SSTable 2 files in L(1) 20 files in L(2) Compaction Row Partition r1 #0 L(0) r2 #2 r3 #-1 -2 -1 0 1 r4 #1 r5 r3 r1 r4 r5 #-2 r6 #6 L(1) 2 3 6 6 2 3 6 6 r7 #3 r2 r7 r6 r8 r2 r7 r6 r8 r8 #6 r9 #-1 -2 -1 0 1 r10 #-2 L(2) r5 r3 r1 r4 36
Recommend
More recommend