sasi cassandra on the full text search ride
play

SASI, Cassandra on the full text search ride DuyHai DOAN Apache - PowerPoint PPT Presentation

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5 minutes introduction to Apache Cassandra 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some


  1. SASI, Cassandra on the full text search ride DuyHai DOAN – Apache Cassandra™ Evangelist

  2. 1 5 minutes introduction to Apache Cassandra™ 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some benchmarks 7 Take away @doanduyhai 2

  3. Trademark Policy From now on … Cassandra ⩵ Apache Cassandra™ @doanduyhai 3

  4. 5 minutes introduction to Apache Cassandra™ @doanduyhai

  5. The tokens Random hash of #partition à token = hash( #p ) C * C * Hash: ] –x, x ] C * C * hash range: 2 64 values x = 2 64 /2 C * C * C * C * @doanduyhai 5

  6. Token ranges ⎤ ⎤ ⎤ ⎤ A : − x , − 3 x E : 0 , x B C ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 4 4 ⎦ ⎦ ⎦ ⎦ ⎤ ⎤ ⎤ ⎤ B : − 3 x , − 2 x F : x , 2 x ⎥ ⎥ ⎥ ⎥ A D ⎥ ⎥ ⎥ ⎥ 4 4 4 4 ⎦ ⎦ ⎦ ⎦ ⎤ ⎤ ⎤ ⎤ C : − 2 x , − x G : 2 x , 3 x ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 4 4 4 4 ⎦ ⎦ ⎦ ⎦ H E ⎤ ⎤ ⎤ ⎤ D : − x H : 3 x , 0 , x ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 4 4 ⎦ ⎦ ⎦ ⎦ G F @doanduyhai 6

  7. Distributed tables CREATE TABLE users( B C user_id int, …, PRIMARY KEY( user_id ) A D ), user_id 1 user_id 2 H E user_id 3 user_id 4 user_id 5 G F @doanduyhai 7

  8. Distributed tables user_id 3 B C user_id 1 A D user_id 4 user_id 2 H E user_id 5 G F @doanduyhai 8

  9. Coordinator node 2 3 Responsible for handling requests (read/write) B C Every node can be coordinator 1 A D • masterless • no SPOF • proxy role H E request coordinator G F @doanduyhai 9

  10. � � Q & A @doanduyhai 10

  11. SASI introduction @doanduyhai

  12. What is SASI ? • S STable- A ttached S econdary I ndex à new 2 nd index impl that follows SSTable life-cycle • Objective: provide more performant & capable 2 nd index @doanduyhai 12

  13. Who created it ? Open-source contribution by an engineers team @doanduyhai 13

  14. Why is it better than native 2 nd index ? follow SSTable life-cycle (flush, compaction, rebuild …) à more optimized • new data-strutures • range query (<, ≤ , >, ≥ ) possible • full text search options • @doanduyhai 14

  15. Demo @doanduyhai 15

  16. SASI cluster-wide @doanduyhai

  17. Distributed index On cluster level, SASI works exactly like native 2 nd index B C UK user 87 user 176 … user 987 A D UK user 1 user 102 … user 493 UK user 17 user 409 … user 787 US user 54 user 483 … user 938 H E G F @doanduyhai 17

  18. Distributed search algorithm B C A D 1 st round Concurrency factor = 1 H E coordinator G F @doanduyhai 18

  19. Distributed search algorithm B C A D Not enough results ? H E coordinator G F @doanduyhai 19

  20. Distributed search algorithm B C 2 nd round Concurrency factor = 2 A D H E coordinator G F @doanduyhai 20

  21. Distributed search algorithm B C A D Still not enough results ? H E coordinator G F @doanduyhai 21

  22. Distributed search algorithm B C A D 3 rd round Concurrency factor = 4 H E coordinator G F @doanduyhai 22

  23. Concurrency factor formula • more details at: http://www.doanduyhai.com/blog/?p=13191 @doanduyhai 23

  24. Caveat 1: non restrictive filters B C Hit all nodes A D eventually L H E coordinator G F @doanduyhai 24

  25. Caveat 1 solution : always use LIMIT B C SELECT * FROM … A D WHERE ... LIMIT 1000 H E coordinator G F @doanduyhai 25

  26. Caveat 2: 1-to-1 index ( user_email ) B C A D WHERE user_email = ‘xxx' Not found H E coordinator G F @doanduyhai 26

  27. Caveat 2: 1-to-1 index ( user_email ) B C A D WHERE user_email = ‘xxx' H E Still no result coordinator G F @doanduyhai 27

  28. Caveat 2: 1-to-1 index ( user_email ) B C A D WHERE user_email = ‘xxx' At best 1 user found At worst 0 user found H E coordinator G F @doanduyhai 28

  29. Caveat 2 solution: materialized views For 1-to-1 index/relationship, use materialized views instead CREATE MATERIALIZED VIEW user_by_email AS SELECT * FROM users WHERE user_id IS NOT NULL and user_email IS NOT NULL PRIMARY KEY (user_email, user_id) But range queries ( <, >, ≤ , ≥ ) not possible … @doanduyhai 29

  30. Caveat 3: fetch all rows for analytics use-case B C A D Client H E coordinator G F @doanduyhai 30

  31. Caveat 3 solution: use co-located Spark B C Local index query A D Local index filtering in Cassandra Aggregation in Spark H E G F @doanduyhai 31

  32. SASI local read/write path @doanduyhai

  33. SASI Life-cycle: in-memory ACK the client MemTable MemTable MemTable 2 . . . Table 1 Table 2 Table N Memory Index Index Index 3 . . . MemTable 1 MemTable 2 MemTable N 1 Commit log 1 Commit log 2 . . . Commit log n @doanduyhai 33

  34. Local write path data structures Index mode, data type Data structure Usage PREFIX , text Guava ConcurrentRadixTree name LIKE 'John%' name LIKE ’%John%' CONTAINS , text Guava ConcurrentSuffixTree name LIKE ’%ny’ age = 20 PREFIX , other JDK ConcurrentSkipListSet age >= 20 AND age <= 30 age = 20 SPARSE , other JDK ConcurrentSkipListSet age >= 20 AND age <= 30 suitable for 1-to-N index with N ≤ 5 @doanduyhai 34

  35. SASI Life-cycle: flush to SSTable Memory Table 1 Table 2 Table 3 1 Commit log 1 SStable 2 SStable 3 Commit log 2 4 SStable 1 OnDiskIndex 2 OnDiskIndex 3 . . . Commit log n OnDiskIndex 1 @doanduyhai 35

  36. SASI Life-cycle: compaction SSTable 1 SSTable 2 SSTable 3 OnDiskIndex 1 OnDiskIndex 2 OnDiskIndex 3 New SSTable New OnDiskIndex @doanduyhai 36

  37. Local write path summary Index files are built on memtable flush • on compaction flush • To avoid OOM, index files are split into chunk of 1Gb for memtable flush • max_compaction_flush_memory_in_mb for compaction flush • à consequences: SASI has impact on write bandwidth (CPU & disk I/O) @doanduyhai 37

  38. Local read path first, optimize query using Query Planer (see later) • then load chunks (4k) of index files from disk into memory • perform binary search to find the indexed value(s) • retrieve the corresponding partition keys and push them into the Partition • Key Cache à Yes, currently SASI only keep partition key(s) so on wide partition it’s not very optimized ... @doanduyhai 38

  39. OnDiskIndex files SStable 1 user_id 4 FR user_id 1 US user_id 5 FR OnDiskIndex 1 FR US SStable 2 B+Tree-like data structures user_id 3 UK user_id 2 DE OnDiskIndex 2 UK DE @doanduyhai 39

  40. OnDiskIndex Layout Header Data Block Block 4k Multiple of 4k Meta Data Info Pointer Data Block Level Index Levels Pointer Block Count Block Meta Meta Offset Multiple of 4k @doanduyhai 40

  41. Header Block Layout Header Block layout Descriptor Term Min Max Min Max Index Has Version Size Term Term Pk Pk Mode Partial variable short short short short short variable byte @doanduyhai 41

  42. OnDiskIndex Layout Header Data Block Block 4k Multiple of 4k Meta Data Info Pointer Data Block Level Index Levels Pointer Block Count Block Meta Meta Offset Multiple of 4k @doanduyhai 42

  43. Data Block layout 4k Terms Count Offset Array: [0, 10, 22, …] Term Block Padding TokenTree Block Padding 4k Terms Count Offset Array: [0, 23, 35, …] Term Block Padding TokenTree Block Padding Terms Count Offset Array: [0, 17, 34, …] Term Block Padding TokenTree Block Padding … Terms Count Offset Array: [0, 12, 28, …] Term Block Padding TokenTree Block Padding @doanduyhai 43

  44. OnDiskIndex Layout Header Data Block Block 4k Multiple of 4k Meta Data Info Pointer Data Block Level Index Levels Pointer Block Count Block Meta Meta Offset Multiple of 4k @doanduyhai 44

  45. Pointer Block building Pointer Root … Root Pointer Block Level Pointer Block N+1 Pointer Block N+2 … Pointer Level 2 LastTerm M+1 LastTerm O LastTerm M 4k Pointer Block 2 … Pointer Block 1 Pointer Block N Pointer Level 1 … LastTerm 1 LastTerm 2 LastTerm N … Data Block 1 Data Block 2 Data Block N Data Level 4k @doanduyhai 45

  46. Binary search using OnDiskIndex files Pointer Root Level Root Pointer Block … Pointer Block Pointer Block Pointer Block Pointer Level 3 … Pointer Level 2 Pointer Block Pointer Block Pointer Block … Pointer Level 1 Pointer Block Pointer Block Pointer Block … Data Level Data Block 1 Data Block 2 Data Block 3 Data Block N @doanduyhai 46

  47. Term Block Binary Search val < Term 100 ? Term 1 Term 25 Term 50 Term 75 Term 100 val > Term 50 ? Term 50 Term 75 Term 100 val < Term 75 ? Term 50 Term 63 Term 75 … val = Term 57 ? Term 57 @doanduyhai 47

  48. Query Planner @doanduyhai

  49. Query planner build predicates tree • predicates push-down & re-ordering • predicate fusions for != operator • @doanduyhai 49

  50. Query optimization example WHERE age < 100 AND fname LIKE 'p%' AND fname != 'pa%' AND age > 21 @doanduyhai 50

  51. Query optimization example AND is associative and commutative @doanduyhai 51

  52. Query optimization example != transformed to exclusion on range scan @doanduyhai 52

  53. Query optimization example AND is associative and commutative @doanduyhai 53

  54. Some benchmarks @doanduyhai

Recommend


More recommend