Leveraging bloom filters on Redis
Cristian Castiblanco me@cristian.io | cristian@scopely.com https://cristian.io
Stream processing at Scopely
Stream processing at Scopely
Idempotence
An operation is said to be idempotent when applying it multiple times has the same effect.
Simplest approach to idempotence
Idempotence with Redis sets
Idempotence with Redis sets
Idempotence with Redis sets
Idempotence with Redis sets
Memory usage per idempotence store 320 million records/day ≈ 70GB of memory
Is there a better way?
Is there a better way? • Space-efficient
Is there a better way? • Space-efficient • Cost-effective
Is there a better way? • Space-efficient • Cost-effective • More performant
Is there a better way? • Space-efficient • Cost-effective • More performant • Awesome
Enter bloom filters Probabilistic data structure to check for item membership
Enter bloom filters Probabilistic data structure to check for item membership
Bloom filters query
Bloom filters query • Definitely not in the set
Bloom filters query • Definitely not in the set • Probably in the set
Bloom filters query • Definitely not in the set • Probably in the set • Configurable error rate
Bloom fiters space efficiency Given 10.000.000 UUIDs...
Bloom fiters space efficiency Given 10.000.000 UUIDs... • Redis set: 1GB
Bloom fiters space efficiency Given 10.000.000 UUIDs... • Redis set: 1GB • Plain text: ~300 MB
Bloom fiters space efficiency Given 10.000.000 UUIDs... • Redis set: 1GB • Plain text: ~300 MB • gzip: ~150 MB
Bloom fiters space efficiency Given 10.000.000 UUIDs... • Redis set: 1GB • Plain text: ~300 MB • gzip: ~150 MB • Bloom filter with 1e-05 error rate: ~30MB (i.e., 1 in a million)
Bloom fiters space efficiency Given 10.000.000 UUIDs... • Redis set: 1GB • Plain text: ~300 MB • gzip: ~150 MB • Bloom filter with 1e-05 error rate: ~30MB (i.e., 1 in a million) • Bloom filter with 1e-11 error rate: ~60MB (i.e., 1 in a million million)
Memory usage comparison Sets 70GB vs Bloom Filters 7GB
Latency comparison Redis sets Bloom filters
Bloom filters example
False positive == dropped data
Bloom filters characteristics • Capacity • Error rate probability
Scaling bloom filters
Scaling bloom filters
Scaling bloom filters
Scaling bloom filters
Scaling bloom filters
Scaling bloom filters
Scaling bloom filters
Scaling bloom filters
Tuning bloom filters Size depends on capacity/error probability
Tuning bloom filters
Tuning bloom filters • False positive probability: • Depends on your use case
Tuning bloom filters • False positive probability: • Depends on your use case • Initial capacity: • Can't be too generous • Can't be too conservative
First attempt: LUA scripts
Second attempt: bloomd github.com/armon/bloomd
bloomd drawbacks
bloomd drawbacks • Lack of High Availability
bloomd drawbacks • Lack of High Availability • No clustering support
bloomd drawbacks • Lack of High Availability • No clustering support • Maintenance
bloomd drawbacks • Lack of High Availability • No clustering support • Maintenance • Rigid API
bloomd drawbacks • Lack of High Availability • No clustering support • Maintenance • Rigid API • Feels like abandonware
ReBloom Bloom filters as a Redis module
ReBloom example > BF.RESERVE your_filter 0.00001 50000000 OK > BF.ADD your_filter foo 1 > BF.EXISTS your_filter foo 1 > BF.EXISTS your_filter bar 0
ReBloom
ReBloom • Clustering
ReBloom • Clustering • Redundancy/replication
ReBloom • Clustering • Redundancy/replication • Lower cognitive overhead
ReBloom • Clustering • Redundancy/replication • Lower cognitive overhead • Powerful API
ReBloom • Clustering • Redundancy/replication • Lower cognitive overhead • Powerful API • No maintainance
Summary • Bloom filters significantly reduce memory usage and latency • Redis modules allows your custom data structures to scale github.com/casidiablo cristian.io
Recommend
More recommend