SASI, Cassandra on the full text search ride DuyHai DOAN – Apache Cassandra™ Evangelist
1 5 minutes introduction to Apache Cassandra™ 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some benchmarks 7 Take away @doanduyhai 2
Trademark Policy From now on … Cassandra ⩵ Apache Cassandra™ @doanduyhai 3
5 minutes introduction to Apache Cassandra™ @doanduyhai
The tokens Random hash of #partition à token = hash( #p ) C * C * Hash: ] –x, x ] C * C * hash range: 2 64 values x = 2 64 /2 C * C * C * C * @doanduyhai 5
Token ranges ⎤ ⎤ ⎤ ⎤ A : − x , − 3 x E : 0 , x B C ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 4 4 ⎦ ⎦ ⎦ ⎦ ⎤ ⎤ ⎤ ⎤ B : − 3 x , − 2 x F : x , 2 x ⎥ ⎥ ⎥ ⎥ A D ⎥ ⎥ ⎥ ⎥ 4 4 4 4 ⎦ ⎦ ⎦ ⎦ ⎤ ⎤ ⎤ ⎤ C : − 2 x , − x G : 2 x , 3 x ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 4 4 4 4 ⎦ ⎦ ⎦ ⎦ H E ⎤ ⎤ ⎤ ⎤ D : − x H : 3 x , 0 , x ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 4 4 ⎦ ⎦ ⎦ ⎦ G F @doanduyhai 6
Distributed tables CREATE TABLE users( B C user_id int, …, PRIMARY KEY( user_id ) A D ), user_id 1 user_id 2 H E user_id 3 user_id 4 user_id 5 G F @doanduyhai 7
Distributed tables user_id 3 B C user_id 1 A D user_id 4 user_id 2 H E user_id 5 G F @doanduyhai 8
Coordinator node 2 3 Responsible for handling requests (read/write) B C Every node can be coordinator 1 A D • masterless • no SPOF • proxy role H E request coordinator G F @doanduyhai 9
� � Q & A @doanduyhai 10
SASI introduction @doanduyhai
What is SASI ? • S STable- A ttached S econdary I ndex à new 2 nd index impl that follows SSTable life-cycle • Objective: provide more performant & capable 2 nd index @doanduyhai 12
Who created it ? Open-source contribution by an engineers team @doanduyhai 13
Why is it better than native 2 nd index ? follow SSTable life-cycle (flush, compaction, rebuild …) à more optimized • new data-strutures • range query (<, ≤ , >, ≥ ) possible • full text search options • @doanduyhai 14
Demo @doanduyhai 15
SASI cluster-wide @doanduyhai
Distributed index On cluster level, SASI works exactly like native 2 nd index B C UK user 87 user 176 … user 987 A D UK user 1 user 102 … user 493 UK user 17 user 409 … user 787 US user 54 user 483 … user 938 H E G F @doanduyhai 17
Distributed search algorithm B C A D 1 st round Concurrency factor = 1 H E coordinator G F @doanduyhai 18
Distributed search algorithm B C A D Not enough results ? H E coordinator G F @doanduyhai 19
Distributed search algorithm B C 2 nd round Concurrency factor = 2 A D H E coordinator G F @doanduyhai 20
Distributed search algorithm B C A D Still not enough results ? H E coordinator G F @doanduyhai 21
Distributed search algorithm B C A D 3 rd round Concurrency factor = 4 H E coordinator G F @doanduyhai 22
Concurrency factor formula • more details at: http://www.doanduyhai.com/blog/?p=13191 @doanduyhai 23
Caveat 1: non restrictive filters B C Hit all nodes A D eventually L H E coordinator G F @doanduyhai 24
Caveat 1 solution : always use LIMIT B C SELECT * FROM … A D WHERE ... LIMIT 1000 H E coordinator G F @doanduyhai 25
Caveat 2: 1-to-1 index ( user_email ) B C A D WHERE user_email = ‘xxx' Not found H E coordinator G F @doanduyhai 26
Caveat 2: 1-to-1 index ( user_email ) B C A D WHERE user_email = ‘xxx' H E Still no result coordinator G F @doanduyhai 27
Caveat 2: 1-to-1 index ( user_email ) B C A D WHERE user_email = ‘xxx' At best 1 user found At worst 0 user found H E coordinator G F @doanduyhai 28
Caveat 2 solution: materialized views For 1-to-1 index/relationship, use materialized views instead CREATE MATERIALIZED VIEW user_by_email AS SELECT * FROM users WHERE user_id IS NOT NULL and user_email IS NOT NULL PRIMARY KEY (user_email, user_id) But range queries ( <, >, ≤ , ≥ ) not possible … @doanduyhai 29
Caveat 3: fetch all rows for analytics use-case B C A D Client H E coordinator G F @doanduyhai 30
Caveat 3 solution: use co-located Spark B C Local index query A D Local index filtering in Cassandra Aggregation in Spark H E G F @doanduyhai 31
SASI local read/write path @doanduyhai
SASI Life-cycle: in-memory ACK the client MemTable MemTable MemTable 2 . . . Table 1 Table 2 Table N Memory Index Index Index 3 . . . MemTable 1 MemTable 2 MemTable N 1 Commit log 1 Commit log 2 . . . Commit log n @doanduyhai 33
Local write path data structures Index mode, data type Data structure Usage PREFIX , text Guava ConcurrentRadixTree name LIKE 'John%' name LIKE ’%John%' CONTAINS , text Guava ConcurrentSuffixTree name LIKE ’%ny’ age = 20 PREFIX , other JDK ConcurrentSkipListSet age >= 20 AND age <= 30 age = 20 SPARSE , other JDK ConcurrentSkipListSet age >= 20 AND age <= 30 suitable for 1-to-N index with N ≤ 5 @doanduyhai 34
SASI Life-cycle: flush to SSTable Memory Table 1 Table 2 Table 3 1 Commit log 1 SStable 2 SStable 3 Commit log 2 4 SStable 1 OnDiskIndex 2 OnDiskIndex 3 . . . Commit log n OnDiskIndex 1 @doanduyhai 35
SASI Life-cycle: compaction SSTable 1 SSTable 2 SSTable 3 OnDiskIndex 1 OnDiskIndex 2 OnDiskIndex 3 New SSTable New OnDiskIndex @doanduyhai 36
Local write path summary Index files are built on memtable flush • on compaction flush • To avoid OOM, index files are split into chunk of 1Gb for memtable flush • max_compaction_flush_memory_in_mb for compaction flush • à consequences: SASI has impact on write bandwidth (CPU & disk I/O) @doanduyhai 37
Local read path first, optimize query using Query Planer (see later) • then load chunks (4k) of index files from disk into memory • perform binary search to find the indexed value(s) • retrieve the corresponding partition keys and push them into the Partition • Key Cache à Yes, currently SASI only keep partition key(s) so on wide partition it’s not very optimized ... @doanduyhai 38
OnDiskIndex files SStable 1 user_id 4 FR user_id 1 US user_id 5 FR OnDiskIndex 1 FR US SStable 2 B+Tree-like data structures user_id 3 UK user_id 2 DE OnDiskIndex 2 UK DE @doanduyhai 39
OnDiskIndex Layout Header Data Block Block 4k Multiple of 4k Meta Data Info Pointer Data Block Level Index Levels Pointer Block Count Block Meta Meta Offset Multiple of 4k @doanduyhai 40
Header Block Layout Header Block layout Descriptor Term Min Max Min Max Index Has Version Size Term Term Pk Pk Mode Partial variable short short short short short variable byte @doanduyhai 41
OnDiskIndex Layout Header Data Block Block 4k Multiple of 4k Meta Data Info Pointer Data Block Level Index Levels Pointer Block Count Block Meta Meta Offset Multiple of 4k @doanduyhai 42
Data Block layout 4k Terms Count Offset Array: [0, 10, 22, …] Term Block Padding TokenTree Block Padding 4k Terms Count Offset Array: [0, 23, 35, …] Term Block Padding TokenTree Block Padding Terms Count Offset Array: [0, 17, 34, …] Term Block Padding TokenTree Block Padding … Terms Count Offset Array: [0, 12, 28, …] Term Block Padding TokenTree Block Padding @doanduyhai 43
OnDiskIndex Layout Header Data Block Block 4k Multiple of 4k Meta Data Info Pointer Data Block Level Index Levels Pointer Block Count Block Meta Meta Offset Multiple of 4k @doanduyhai 44
Pointer Block building Pointer Root … Root Pointer Block Level Pointer Block N+1 Pointer Block N+2 … Pointer Level 2 LastTerm M+1 LastTerm O LastTerm M 4k Pointer Block 2 … Pointer Block 1 Pointer Block N Pointer Level 1 … LastTerm 1 LastTerm 2 LastTerm N … Data Block 1 Data Block 2 Data Block N Data Level 4k @doanduyhai 45
Binary search using OnDiskIndex files Pointer Root Level Root Pointer Block … Pointer Block Pointer Block Pointer Block Pointer Level 3 … Pointer Level 2 Pointer Block Pointer Block Pointer Block … Pointer Level 1 Pointer Block Pointer Block Pointer Block … Data Level Data Block 1 Data Block 2 Data Block 3 Data Block N @doanduyhai 46
Term Block Binary Search val < Term 100 ? Term 1 Term 25 Term 50 Term 75 Term 100 val > Term 50 ? Term 50 Term 75 Term 100 val < Term 75 ? Term 50 Term 63 Term 75 … val = Term 57 ? Term 57 @doanduyhai 47
Query Planner @doanduyhai
Query planner build predicates tree • predicates push-down & re-ordering • predicate fusions for != operator • @doanduyhai 49
Query optimization example WHERE age < 100 AND fname LIKE 'p%' AND fname != 'pa%' AND age > 21 @doanduyhai 50
Query optimization example AND is associative and commutative @doanduyhai 51
Query optimization example != transformed to exclusion on range scan @doanduyhai 52
Query optimization example AND is associative and commutative @doanduyhai 53
Some benchmarks @doanduyhai
Recommend
More recommend