PebblesDB: Building Key-Value Stores using Fragmented Log - - PowerPoint PPT Presentation

pebblesdb building key value stores using fragmented log
SMART_READER_LITE
LIVE PREVIEW

PebblesDB: Building Key-Value Stores using Fragmented Log - - PowerPoint PPT Presentation

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1 , Rohan Kadekodi 1 , Vijay Chidambaram 1,2 , Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research What is a key-value store?


slide-1
SLIDE 1

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Pandian Raju1, Rohan Kadekodi1, Vijay Chidambaram1,2, Ittai Abraham2

1The University of Texas at Austin 2VMware Research

slide-2
SLIDE 2

What is a key-value store?

  • Store any arbitrary value for a given key

123 124

Keys

{“name”: “John Doe”, “age”: 25} {“name”: “Ross Gel”, “age”: 28}

Values

2

slide-3
SLIDE 3

What is a key-value store?

  • Store any arbitrary value for a given key
  • Insertions:
  • Point lookups:
  • Range Queries:

123 124

Keys

{“name”: “John Doe”, “age”: 25} {“name”: “Ross Gel”, “age”: 28}

Values

3

slide-4
SLIDE 4

What is a key-value store?

  • Store any arbitrary value for a given key
  • Insertions: put(key, value)
  • Point lookups:
  • Range Queries:

123 124

Keys

{“name”: “John Doe”, “age”: 25} {“name”: “Ross Gel”, “age”: 28}

Values

4

slide-5
SLIDE 5

What is a key-value store?

  • Store any arbitrary value for a given key
  • Insertions: put(key, value)
  • Point lookups: get(key)
  • Range Queries:

123 124

Keys

{“name”: “John Doe”, “age”: 25} {“name”: “Ross Gel”, “age”: 28}

Values

5

slide-6
SLIDE 6

What is a key-value store?

  • Store any arbitrary value for a given key
  • Insertions: put(key, value)
  • Point lookups: get(key)
  • Range Queries: get_range(key1, key2)

123 124

Keys

{“name”: “John Doe”, “age”: 25} {“name”: “Ross Gel”, “age”: 28}

Values

6

slide-7
SLIDE 7

Key-Value Stores - widely used

  • Google’s BigTable powers Search, Analytics, Maps and Gmail
  • Facebook’s RocksDB is used as storage engine in production

systems of many companies

7

slide-8
SLIDE 8

Write-optimized data structures

  • Log Structured Merge Tree (LSM) is a write-optimized data structure

used in key-value stores

  • Provides high write throughput with good read throughput, but

suffers high write amplification

8

slide-9
SLIDE 9
  • Log Structured Merge Tree (LSM) is a write-optimized data structure

used in key-value stores

  • Provides high write throughput with good read throughput, but

suffers high write amplification

  • Write amplification - Ratio of amount of write IO to amount of user

data

KV-store Client 10 GB User data

If total write I/O is 200 GB Write amplification = 20

9

Write-optimized data structures

slide-10
SLIDE 10
  • Inserted 500M key-value pairs
  • Key: 16 bytes, Value: 128 bytes
  • Total user data: ~45 GB

45 300 600 900 1200 1500 1800 2100 RocksDB LevelDB PebblesDB User Data Write IO (GB)

Write amplification in LSM based KV stores

10

slide-11
SLIDE 11
  • Inserted 500M key-value pairs
  • Key: 16 bytes, Value: 128 bytes
  • Total user data: ~45 GB

1868 (42x) 1222 (27x) 756 (17x) 45 300 600 900 1200 1500 1800 2100 RocksDB LevelDB PebblesDB User Data Write IO (GB) 11

Write amplification in LSM based KV stores

slide-12
SLIDE 12

Why is write amplification bad?

  • Reduces the write throughput
  • Flash devices wear out after limited write cycles

(Intel SSD DC P4600 – can last ~5 years assuming ~5 TB write per day)

RocksDB can write ~500 GB of user data per day to a SSD to last 1.25 years

Data source: https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-1-6tb-2-5inch-3d1.html 12

slide-13
SLIDE 13

PebblesDB

Built using new data structure Fragmented Log-Structured Merge Tree High performance write-optimized key-value store

Achieves 3-6.7x higher write throughput and 2.4-3x lesser write amplification compared to RocksDB Gets the highest write throughput and least write amplification as a backend store to MongoDB

13

slide-14
SLIDE 14

Outline

  • Log-Structured Merge Tree (LSM)
  • Fragmented Log-Structured Merge Tree (FLSM)
  • Building PebblesDB using FLSM
  • Evaluation
  • Conclusion

14

slide-15
SLIDE 15

Outline

  • Log-Structured Merge Tree (LSM)
  • Fragmented Log-Structured Merge Tree (FLSM)
  • Building PebblesDB using FLSM
  • Evaluation
  • Conclusion

15

slide-16
SLIDE 16

Log Structured Merge Tree (LSM)

Data is stored both in memory and storage

Memory Storage In-memory

16

File 1

slide-17
SLIDE 17

Writes are directly put to memory

In-memory Memory Storage Write (key, value)

17

File 1

Log Structured Merge Tree (LSM)

slide-18
SLIDE 18

Memory File 1 File 2

In-memory data is periodically written as files to storage (sequential I/O)

In-memory

18

Storage

Log Structured Merge Tree (LSM)

slide-19
SLIDE 19

Files on storage are logically arranged in different levels

In-memory Memory Level 0 Level 1 Level n

19

Storage

Log Structured Merge Tree (LSM)

slide-20
SLIDE 20

Compaction pushes data to higher numbered levels

In-memory Memory Level 0 Level 1 Level n

20

Storage

Log Structured Merge Tree (LSM)

slide-21
SLIDE 21

Files are sorted and have non-overlapping key ranges

In-memory Memory 1 .… 12 15 …. 19 25 …. 75 79 …. 99 Search using binary search Level 0 Level 1 Level n

21

Storage

Log Structured Merge Tree (LSM)

slide-22
SLIDE 22

Level 0 can have files with overlapping (but sorted) key ranges

In-memory Memory 2 …. 57 23 …. 78 Level 0 Level 1 Level n Limit on number

  • f level 0 files

22

Storage

Log Structured Merge Tree (LSM)

slide-23
SLIDE 23

Write amplification: Illustration

Max files in level 0 is configured to be 2

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 25 39 …. 62 77 …. 95 Level 0 Level 1 Level n In-memory 58 …. 68

Level 1 re-write counter: 1

23

Storage

slide-24
SLIDE 24

Write amplification: Illustration

Level 0 has 3 files (> 2), which triggers a compaction

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 25 39 …. 62 77 …. 95 Level 0 Level 1 Level n 58 …. 68 In-memory

Level 1 re-write counter: 1

24

Storage

slide-25
SLIDE 25

Write amplification: Illustration

* Files are immutable * Sorted non-overlapping files

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 25 39 …. 62 77 …. 95 Level 0 Level 1 Level n 58 …. 68 In-memory

Level 1 re-write counter: 1

25

Storage

slide-26
SLIDE 26

Write amplification: Illustration

Set of overlapping files between levels 0 and 1

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 25 39 …. 62 77 …. 95 Level 0 Level 1 Level n 58 …. 68 In-memory

Level 1 re-write counter: 1

26

Storage

slide-27
SLIDE 27

Write amplification: Illustration

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 25 39 …. 62 77 …. 95 Level 0 Level 1 Level n 58 …. 68 In-memory

Level 1 re-write counter: 1

27

Storage

Set of overlapping files between levels 0 and 1

slide-28
SLIDE 28

Write amplification: Illustration

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 25 39 …. 62 77 …. 95 Level 0 Level 1 Level n 58 …. 68 In-memory

Level 1 re-write counter: 1

28

Storage

Set of overlapping files between levels 0 and 1

slide-29
SLIDE 29

1 …. 23 47 …. 68 24 …. 46 1 …. 68

Write amplification: Illustration

Compacting level 0 with level 1

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 25 39 …. 62 77 …. 95 Level 0 Level 1 Level n 58 …. 68 In-memory

Level 1 re-write counter: 1 Level 1 re-write counter: 2

29

Storage

slide-30
SLIDE 30

Write amplification: Illustration

Level 0 is compacted

Memory 1 …. 23 24 …. 46 47 …. 68 77 …. 95 Level 0 Level 1 Level n In-memory

Level 1 re-write counter: 2

30

Storage

slide-31
SLIDE 31

Write amplification: Illustration

Data is being flushed as level 0 files after some Write operations

Memory 1 …. 23 24 …. 46 47 …. 68 77 …. 95 Level 0 Level 1 Level n 10 …. 33 17 …. 53 1 …. 121

Level 1 re-write counter: 2

31

Storage

slide-32
SLIDE 32

Write amplification: Illustration

Compacting level 0 with level 1

Memory 1 …. 23 24 …. 46 47 …. 68 77 …. 95 Level 0 Level 1 Level n 10 …. 33 17 …. 53 1 …. 121

Level 1 re-write counter: 2

32

Storage

slide-33
SLIDE 33

92 …. 121 62 …. 90 31 …. 60 1 …. 30

Write amplification: Illustration

Memory Level 0 Level 1 Level n 1 …. 121

Level 1 re-write counter: 2 Level 1 re-write counter: 3

33

Storage

Compacting level 0 with level 1

slide-34
SLIDE 34

Write amplification: Illustration

Existing data is re-written to the same level (1) 3 times

Memory 1 …. 30 31 …. 60 62 …. 90 92 …. 121 Level 0 Level 1 Level n

Level 1 re-write counter: 3

34

Storage

slide-35
SLIDE 35

Root cause of write amplification Rewriting data to the same level multiple times To maintain sorted non-overlapping files in each level

35

slide-36
SLIDE 36

Outline

  • Log-Structured Merge Tree (LSM)
  • Fragmented Log-Structured Merge Tree (FLSM)
  • Building PebblesDB using FLSM
  • Evaluation
  • Conclusion

36

slide-37
SLIDE 37

Naïve approach to reduce write amplification

  • Just append the file to the end of next level
  • Many (possibly all) overlapping files within a level
  • Affects the read performance

1 …. 89 6 …. 91 5 …. 65 9 …. 99 1 …. 102 1 … 271 8 …. 95 Level i

(all files have overlapping key ranges)

37

slide-38
SLIDE 38

Partially sorted levels

  • Hybrid between all non-overlapping files and all overlapping files
  • Inspired from Skip-List data structure
  • Concrete boundaries (guards) to group together overlapping files

1 …. 12 18 …. 31 13 …. 34 42 …. 65 72 …. 87 45 …. 56 40 …. 47 Level i

(files of same color can have overlapping key ranges)

38

13 35 70

slide-39
SLIDE 39

Fragmented Log-Structured Merge Tree

Novel modification of LSM data structure Uses guards to maintain partially sorted levels Writes data only once per level in most cases

39

slide-40
SLIDE 40

FLSM structure

Note how files are logically grouped within guards

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

40

Storage

slide-41
SLIDE 41

Guards get more fine grained deeper into the tree

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

41

Storage

FLSM structure

slide-42
SLIDE 42

How does FLSM reduce write amplification?

42

slide-43
SLIDE 43

In-memory

How does FLSM reduce write amplification?

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 15 70 40 70 15 95 30 …. 68

Max files in level 0 is configured to be 2

43

Storage

slide-44
SLIDE 44

2 …. 14 15 …. 68

Compacting level 0

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95 30 …. 68 2 …. 68

44

15 Storage

How does FLSM reduce write amplification?

slide-45
SLIDE 45

15 …. 59 2 …. 14 15 …. 68

Fragmented files are just appended to next level

Memory 1 …. 12 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 40 70 15 95 77 …. 87 82 …. 95 70

45

15 Storage

How does FLSM reduce write amplification?

slide-46
SLIDE 46

15 …. 59 2 …. 14 15 …. 68

Guard 15 in Level 1 is to be compacted

Memory 1 …. 12 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 40 70 15 95 77 …. 87 82 …. 95 70 15 …. 68

46

Storage

How does FLSM reduce write amplification?

slide-47
SLIDE 47

15 …. 39 40 …. 68 2 …. 14

Files are combined, sorted and fragmented

Memory 1 …. 12 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 40 70 15 95 77 …. 87 82 …. 95 70 15 …. 68

47

40 Storage

How does FLSM reduce write amplification?

slide-48
SLIDE 48

15 …. 39 40 …. 68 2 …. 14

Fragmented files are just appended to next level

Memory 1 …. 12 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 40 70 15 95 77 …. 87 82 …. 95 70

48

40 Storage

How does FLSM reduce write amplification?

slide-49
SLIDE 49

FLSM maintains partially sorted levels to efficiently reduce the search space

How does FLSM reduce write amplification?

FLSM doesn’t re-write data to the same level in most cases

How does FLSM maintain read performance?

49

slide-50
SLIDE 50

Selecting Guards

50

  • Guards are chosen randomly and dynamically
  • Dependent on the distribution of data
slide-51
SLIDE 51

Selecting Guards

51

1 1e+9 Keyspace

  • Guards are chosen randomly and dynamically
  • Dependent on the distribution of data
slide-52
SLIDE 52

Selecting Guards

52

1 1e+9 Keyspace

  • Guards are chosen randomly and dynamically
  • Dependent on the distribution of data
slide-53
SLIDE 53

Selecting Guards

  • Guards are chosen randomly and dynamically
  • Dependent on the distribution of data

53

1 1e+9 Keyspace

slide-54
SLIDE 54

Operations: Write

FLSM structure

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Put(1, “abc”)

Write (key, value)

54

Storage

slide-55
SLIDE 55

FLSM structure

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Get(23)

55

Storage

Operations: Get

slide-56
SLIDE 56

Search level by level starting from memory

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Get(23)

56

Storage

Operations: Get

slide-57
SLIDE 57

All level 0 files need to be searched

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Get(23)

57

Storage

Operations: Get

slide-58
SLIDE 58

Level 1: File under guard 15 is searched

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Get(23)

58

Storage

Operations: Get

slide-59
SLIDE 59

Level 2: Both the files under guard 15 are searched

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Get(23)

59

Storage

Operations: Get

slide-60
SLIDE 60

High write throughput in FLSM

  • Compaction from memory to level 0 is stalled
  • Writes to memory is also stalled

Memory Storage 1 …. 37 18 …. 48 Level 0 In-memory 2 …. 98 23 …. 48 Write (key, value)

If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction

60

slide-61
SLIDE 61

High write throughput in FLSM

  • Compaction from memory to level 0 is stalled
  • Writes to memory is also stalled

Memory Storage 1 …. 37 18 …. 48 Level 0 In-memory 2 …. 98 23 …. 48 Write (key, value)

If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction

61

FLSM has faster compaction because of lesser I/O and hence higher write throughput

slide-62
SLIDE 62

Challenges in FLSM

  • Every read/range query operation needs to examine multiple

files per level

  • For example, if every guard has 5 files, read latency is

increased by 5x (assuming no cache hits) Trade-off between write I/O and read performance

62

slide-63
SLIDE 63

Outline

  • Log-Structured Merge Tree (LSM)
  • Fragmented Log-Structured Merge Tree (FLSM)
  • Building PebblesDB using FLSM
  • Evaluation
  • Conclusion

63

slide-64
SLIDE 64

PebblesDB

  • Built by modifying HyperLevelDB (±9100 LOC) to use FLSM
  • HyperLevelDB, built over LevelDB, to provide improved

parallelism and compaction

  • API compatible with LevelDB, but not with RocksDB

64

slide-65
SLIDE 65

Optimizations in PebblesDB

  • Challenge (get/range query): Multiple files in a guard
  • Get() performance is improved using file level bloom filter

65

slide-66
SLIDE 66

Optimizations in PebblesDB

  • Challenge (get/range query): Multiple files in a guard
  • Get() performance is improved using file level bloom filter

66

Bloom filter Is key 25 present?

Definitely not Possibly yes

slide-67
SLIDE 67

Optimizations in PebblesDB

1 …. 12 15 …. 39 82 …. 95 Level 1 15 70

Bloom Filter Bloom Filter Bloom Filter Bloom Filter

77 …. 97

Maintained in-memory

67

  • Challenge (get/range query): Multiple files in a guard
  • Get() performance is improved using file level bloom filter
slide-68
SLIDE 68

Optimizations in PebblesDB

1 …. 12 15 …. 39 82 …. 95 Level 1 15 70

Bloom Filter Bloom Filter Bloom Filter Bloom Filter

77 …. 97

Maintained in-memory

68

  • Challenge (get/range query): Multiple files in a guard
  • Get() performance is improved using file level bloom filter

PebblesDB reads same number of files as any LSM based store

slide-69
SLIDE 69

Optimizations in PebblesDB

  • Challenge (get/range query): Multiple files in a guard
  • Get() performance is improved using file level bloom filter
  • Range query performance is improved using parallel threads

and better compaction

69

slide-70
SLIDE 70

Outline

  • Log-Structured Merge Tree (LSM)
  • Fragmented Log-Structured Merge Tree (FLSM)
  • Building PebblesDB using FLSM
  • Evaluation
  • Conclusion

70

slide-71
SLIDE 71

Evaluation

Micro-benchmarks

71

Low memory Small dataset Crash recovery CPU and memory usage Aged file system Real world workloads - YCSB NoSQL applications

slide-72
SLIDE 72

Evaluation

Micro-benchmarks

72

Low memory Small dataset Crash recovery CPU and memory usage Aged file system Real world workloads - YCSB NoSQL applications

slide-73
SLIDE 73

Real world workloads - YCSB

0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt HyperLevelDB

  • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark
  • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

73

slide-74
SLIDE 74

35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt HyperLevelDB

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

74

Real world workloads - YCSB

  • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark
  • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
slide-75
SLIDE 75

35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt HyperLevelDB

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

75

Real world workloads - YCSB

  • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark
  • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
slide-76
SLIDE 76

35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt HyperLevelDB

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

76

Real world workloads - YCSB

  • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark
  • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
slide-77
SLIDE 77

35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt HyperLevelDB

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

77

Real world workloads - YCSB

  • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark
  • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
slide-78
SLIDE 78

35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt HyperLevelDB

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

78

Real world workloads - YCSB

  • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark
  • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
slide-79
SLIDE 79

NoSQL stores - MongoDB

0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt WiredTiger

  • YCSB on MongoDB, a widely used key-value store
  • Inserted 20M key-value pairs with 1 KB value size and 10M operations

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

79

slide-80
SLIDE 80

20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt WiredTiger

  • YCSB on MongoDB, a widely used key-value store
  • Inserted 20M key-value pairs with 1 KB value size and 10M operations

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

80

NoSQL stores - MongoDB

slide-81
SLIDE 81

20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt WiredTiger

  • YCSB on MongoDB, a widely used key-value store
  • Inserted 20M key-value pairs with 1 KB value size and 10M operations

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

81

NoSQL stores - MongoDB

slide-82
SLIDE 82

20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt WiredTiger

  • YCSB on MongoDB, a widely used key-value store
  • Inserted 20M key-value pairs with 1 KB value size and 10M operations

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

82

NoSQL stores - MongoDB

slide-83
SLIDE 83

20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt WiredTiger

  • YCSB on MongoDB, a widely used key-value store
  • Inserted 20M key-value pairs with 1 KB value size and 10M operations

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

83

NoSQL stores - MongoDB

slide-84
SLIDE 84

20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0.5 1 1.5 2 2.5 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt WiredTiger

  • YCSB on MongoDB, a widely used key-value store
  • Inserted 20M key-value pairs with 1 KB value size and 10M operations

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

84

NoSQL stores - MongoDB

PebblesDB combines low write IO of WiredTiger with high performance of RocksDB

slide-85
SLIDE 85

Outline

  • Log-Structured Merge Tree (LSM)
  • Fragmented Log-Structured Merge Tree (FLSM)
  • Building PebblesDB using FLSM
  • Evaluation
  • Conclusion

85

slide-86
SLIDE 86

Conclusion

  • PebblesDB: key-value store built on Fragmented Log-Structured

Merge Trees

  • Increases write throughput and reduces write IO at the same time
  • Obtains 6X the write throughput of RocksDB
  • As key-value stores become more widely used, there have been

several attempts to optimize them

  • PebblesDB combines algorithmic innovation (the FLSM data

structure) with careful systems building

86

slide-87
SLIDE 87

https://github.com/utsaslab/pebblesdb

slide-88
SLIDE 88

https://github.com/utsaslab/pebblesdb

slide-89
SLIDE 89

Backup slides

89

slide-90
SLIDE 90

Operations: Seek

  • Seek(target): Returns the smallest key in the database

which is >= target

  • Used for range queries (for example, return all entries

between 5 and 18)

Get(1)

Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500

90

slide-91
SLIDE 91

Operations: Seek

  • Seek(target): Returns the smallest key in the database

which is >= target

  • Used for range queries (for example, return all entries

between 5 and 18)

Seek(200)

Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500

91

slide-92
SLIDE 92

Operations: Seek

  • Seek(target): Returns the smallest key in the database

which is >= target

  • Used for range queries (for example, return all entries

between 5 and 18)

92

slide-93
SLIDE 93

Operations: Seek

FLSM structure

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Seek(23)

93

Storage

slide-94
SLIDE 94

Operations: Seek

All levels and memtable need to be searched

Memory 2 …. 37 23 …. 48 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 8 15 …. 23 16 …. 32 70 …. 90 96 …. 99 45 …. 65 Level 0 Level 1 Level 2 In-memory 15 70 40 70 15 95

Seek(23)

94

Storage

slide-95
SLIDE 95

Optimizations in PebblesDB

  • Challenge with reads: Multiple sstable reads per level
  • Optimized using sstable level bloom filters
  • Bloom filter: determine if an element is in a set

Bloom filter Is key 25 present?

Definitely not Possibly yes

95

slide-96
SLIDE 96

Optimizations in PebblesDB

  • Challenge with reads: Multiple sstable reads per level
  • Optimized using sstable level bloom filters
  • Bloom filter: determine if an element is in a set

1 …. 12 15 …. 39 82 …. 95 Level 1 15 70

Get(97)

True

Bloom Filter Bloom Filter Bloom Filter Bloom Filter

77 …. 97

Maintained in-memory

96

slide-97
SLIDE 97

Optimizations in PebblesDB

  • Challenge with reads: Multiple sstable reads per level
  • Optimized using sstable level bloom filters
  • Bloom filter: determine if an element is in a set

1 …. 12 15 …. 39 82 …. 95 Level 1 15 70

Get(97)

False True

Bloom Filter Bloom Filter Bloom Filter Bloom Filter

77 …. 97

97

slide-98
SLIDE 98

Optimizations in PebblesDB

  • Challenge with reads: Multiple sstable reads per level
  • Optimized using sstable level bloom filters
  • Bloom filter: determine if an element is in a set

1 …. 12 15 …. 39 82 …. 95 Level 1 15 70

Bloom Filter Bloom Filter Bloom Filter Bloom Filter

77 …. 97

PebblesDB reads at most one file per guard with high probability

98

slide-99
SLIDE 99

Optimizations in PebblesDB

  • Challenge with seeks: Multiple sstable reads per level
  • Parallel seeks: Parallel threads to seek() on files in a guard

1 …. 12 15 …. 39 77 …. 97 82 …. 95 Level 1 15 70

Seek(85)

Thread 1 Thread 2

99

slide-100
SLIDE 100

Optimizations in PebblesDB

  • Challenge with seeks: Multiple sstable reads per level
  • Parallel seeks: Parallel threads to seek() on files in a guard
  • Seek based compaction: Triggers compaction for a level

during a seek-heavy workload

  • Reduce the average number of sstables per guard
  • Reduce the number of active levels

Seek based compaction increases write I/O but as a trade-off to improve seek performance

100

slide-101
SLIDE 101

Tuning PebblesDB

  • PebblesDB characteristics like
  • Increase in write throughput,
  • decrease in write amplification and
  • overhead of read/seek operation

all depend on one parameter, maxFilesPerGuard (default 2 in PebblesDB)

  • Setting this to a very high value favors write throughput
  • Setting this to a very low value favors read throughput

101

slide-102
SLIDE 102

Horizontal compaction

  • Files compacted within the same level for the last two levels

in PebblesDB

  • Some optimizations to prevent huge increase in write IO

102

slide-103
SLIDE 103

Experimental setup

  • Intel Xeon 2.8 GHz processor
  • 16 GB RAM
  • Running Ubuntu 16.04 LTS with the Linux 4.4 kernel
  • Software RAID0 over 2 Intel 750 SSDs (1.2 TB each)
  • Datasets in experiments 3x bigger than DRAM size

103

slide-104
SLIDE 104

Write amplification

7.2 GB 100.7 GB 756 GB 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10M 100M 500M

Write IO ratio wrt PebblesDB Number of keys inserted

  • Inserted different number of keys with key size 16 bytes and value size

128 bytes

104

slide-105
SLIDE 105

Micro-benchmarks

11.72 Kops/s 6.89 Kops/s 7.5 Kops/s 0.5 1 1.5 2 2.5 3 Random-Writes Reads Range-Queries

Throughput ratio wrt HyperLevelDB Benchmark

  • Used db_bench tool that ships with LevelDB
  • Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB
  • Number of read/seek operations: 10M

105

slide-106
SLIDE 106

Micro-benchmarks

239.05 Kops/s 11.72 Kops/s 6.89 Kops/s 7.5 Kops/s 126.2 Kops/s 0.5 1 1.5 2 2.5 3 Seq-Writes Random-Writes Reads Range-Queries Deletes

Throughput ratio wrt HyperLevelDB Benchmark

  • Used db_bench tool that ships with LevelDB
  • Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB
  • Number of read/seek operations: 10M

106

slide-107
SLIDE 107

Multi threaded micro-benchmarks

44.4 Kops/s 40.2 Kops/s 38.8 Kops/s 0.5 1 1.5 2 2.5 Writes Reads Mixed

Throughput ratio wrt HyperLevelDB Benchmark

  • Writes – 4 threads each writing 10M
  • Reads – 4 threads each reading 10M
  • Mixed – 2 threads writing and 2 threads reading (each 10M)

107

slide-108
SLIDE 108

Small cached dataset

  • Insert 1M key-value pairs with 16 bytes key and 1 KB value
  • Total data set (~1 GB) fits within memory
  • PebblesDB-1: with maximum one file per guard

108 45.25 Kops/s 205.76 Kops/s 205.34 Kops/s 0.5 1 1.5 2 2.5 Writes Reads Range-Queries

Throughput ratio wrt HyperLevelDB Benchmark

slide-109
SLIDE 109

Small key-value pairs

  • Inserted 300M key-value pairs
  • Key 16 bytes and 128 bytes value

109 44.48 Kops/s 6.34 Kops/s 6.31 Kops/s 0.5 1 1.5 2 2.5 3 3.5 Writes Reads Range-Queries

Throughput ratio wrt HyperLevelDB Benchmark

slide-110
SLIDE 110

Aged FS and KV store

17.37 Kops/s 5.65 Kops/s 6.29 Kops/s 0.5 1 1.5 2 2.5 Writes Reads Range-Queries

Throughput ratio wrt HyperLevelDB Benchmark

  • File system aging: Fill up 89% of the file system
  • KV store aging: Insert 50M, delete 20M and update 20M key-value

pairs in random order

110

slide-111
SLIDE 111

Low memory micro-benchmark

27.78 Kops/s 2.86 Kops/s 4.37 Kops/s 0.5 1 1.5 2 2.5 Writes Reads Range-Queries

Throughput ratio wrt HyperLevelDB Benchmark

  • 100M key-value pairs with 1KB (~65 GB data set)
  • DRAM was limited to 4 GB

111

slide-112
SLIDE 112

Impact of empty guards

  • Inserted 20M key-value pairs (0 to 20M) in random order

with value size 512 bytes

  • Incrementally inserted new 20M keys after deleting the older

keys

  • Around 9000 empty guards at the start of the last iteration
  • Read latency did not reduce with the increase in empty

guards

112

slide-113
SLIDE 113

22.08 Kops/s 21.85 Kops/s 31.17 Kops/s 32.75 Kops/s 38.02 Kops/s 7.62 Kops/s 0.37 Kops/s 19.11 Kops/s 1349.5 GB 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Load A Run A Run B Run C Run D Load E Run E Run F Total IO

Throughput ratio wrt HyperLevelDB

  • HyperDex – distributed key-value store from Cornell
  • Inserted 20M key-value pairs with 1 KB value size and 10M operations

Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes

113

NoSQL stores - HyperDex

slide-114
SLIDE 114

CPU usage

  • Median CPU usage by inserting 30M keys and reading 10M keys
  • PebblesDB: ~171%
  • Other key-value stores: 98-110%
  • Due to aggressive compaction, more CPU operations due to

merging multiple files in a guard

114

slide-115
SLIDE 115

Memory usage

  • 100M records (16 bytes key, 1 KB value) – 106 GB data set
  • 300 MB memory space
  • 0.3% of data set size
  • Worst case: 100M records (16 bytes key, 16 bytes value)

~3.2 GB

  • 9% of data set size

115

slide-116
SLIDE 116

Bloom filter calculation cost

  • 1.2 sec per GB of sstable
  • 3200 files – 52 GB – 62 seconds

116

slide-117
SLIDE 117

Impact of different optimizations

  • Sstable level bloom filter improve read performance by 63%
  • PebblesDB without optimizations for seek – 66%

117

slide-118
SLIDE 118

Thank you!

Questions?

118