Project Voldemort Jay Kreps 19/11/09 1
The Plan 1. Motivation 2. Core Concepts 3. Implementation 4. In Practice 5. Results
Motivation
The Team • LinkedIn’s Search, Network, and Analytics Team • Project Voldemort • Search Infrastructure: Zoie, Bobo, etc • LinkedIn’s Hadoop system • Recommendation Engine • Data intensive features • People you may know • Who’s viewed my profile • User history service
The Idea of the Relational Database
The Reality of a Modern Web Site
Why did this happen? • The internet centralizes computation • Specialized systems are efficient (10-100x) • Search: Inverted index • Offline: Hadoop, Terradata, Oracle DWH • Memcached • In memory systems (social graph) • Specialized system are scalable • New data and problems • Graphs, sequences, and text
Services and Scale Break Relational DBs • No joins • Lots of denormalization • ORM is less helpful • No constraints, triggers, etc • Caching => key/value model • Latency is key
Two Cheers For Relational Databases • The relational model is a triumph of computer science: • General • Concise • Well understood • But then again: • SQL is a pain • Hard to build re-usable data structures • Don’t hide the memory hierarchy! Good: Filesystem API Bad: SQL, some RPCs
Other Considerations • Who is responsible for performance (engineers? DBA? site operations?) • Can you do capacity planning? • Can you simulate the problem early in the design phase? • How do you do upgrades? • Can you mock your database?
Some motivating factors • This is a latency-oriented system • Data set is large and persistent • Cannot be all in memory • Performance considerations • Partition data • Delay writes • Eliminate network hops • 80% of caching tiers are fixing problems that shouldn’t exist • Need control over system availability and data durability • Must replicate data on multiple machines • Cost of scalability can’t be too high
Inspired By Amazon Dynamo & Memcached Amazon’s Dynamo storage system • • Works across data centers • Eventual consistency • Commodity hardware • Not too hard to build Memcached – Actually works – Really fast – Really simple Decisions: – Multiple reads/writes – Consistent hashing for data distribution – Key-Value model – Data versioning
Priorities 1. Performance and scalability 2. Actually works 3. Community 4. Data consistency 5. Flexible & Extensible 6. Everything else
Why Is This Hard? • Failures in a distributed system are much more complicated • A can talk to B does not imply B can talk to A • A can talk to B does not imply C can talk to B • Getting a consistent view of the cluster is as hard as getting a consistent view of the data • Nodes will fail and come back to life with stale data • I/O has high request latency variance • I/O on commodity disks is even worse • Intermittent failures are common • User must be isolated from these problems • There are fundamental trade-offs between availability and consistency
Core Concepts
Core Concepts - I ACID – Great for single centralized server. CAP Theorem – Consistency (Strict), Availability , Partition Tolerance – Impossible to achieve all three at same time in distributed platform – Can choose 2 out of 3 – Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency to Eventual consistency Consistency Models – Strict consistency 2 Phase Commits PAXOS : distributed algorithm to ensure quorum for consistency – Eventual consistency Different nodes can have different views of value In a steady state system will return last written value. BUT Can have much strong guarantees. 19/11/09 16 Proprietary & Confidential
Core Concept - II Consistent Hashing Key space is Partitioned – Many small partitions Partitions never change – Partitions ownership can change Replication – Each partition is stored by ‘N’ nodes Node Failures – Transient (short term) – Long term Needs faster bootstrapping 19/11/09 17 Proprietary & Confidential
Core Concept - III • N - The replication factor • R - The number of blocking reads • W - The number of blocking writes • If R+W > N then we have a quorum-like algorithm • Guarantees that we will read latest writes OR fail • • R, W, N can be tuned for different use cases W = 1, Highly available writes • R = 1, Read intensive workloads • Knobs to tune performance, durability and availability • 19/11/09 18 Proprietary & Confidential
Core Concepts - IV • Vector Clock [Lamport] provides way to order events in a distributed system. • A vector clock is a tuple {t1 , t2 , ..., tn } of counters. • Each value update has a master node • When data is written with master node i, it increments ti. • All the replicas will receive the same version • Helps resolving consistency between writes on multiple replicas • If you get network partitions • You can have a case where two vector clocks are not comparable. • In this case Voldemort returns both values to clients for conflict resolution 19/11/09 19 Proprietary & Confidential
Implementation
Voldemort Design
Client API • Data is organized into “stores”, i.e. tables • Key-value only • But values can be arbitrarily rich or complex • Maps, lists, nested combinations … • Four operations • PUT (K, V) • GET (K) • MULTI-GET (Keys), • DELETE (K, Version) • No Range Scans
Versioning & Conflict Resolution • Eventual Consistency allows multiple versions of value • Need a way to understand which value is latest • Need a way to say values are not comparable • Solutions • Timestamp • Vector clocks Provides global ordering. • No locking or blocking necessary •
Serialization • Really important • Few Considerations • Schema free? • Backward/Forward compatible • Real life data structures • Bytes <=> objects <=> strings? • Size (No XML) • Many ways to do it -- we allow anything • Compressed JSON, Protocol Buffers, Thrift, Voldemort custom serialization
Routing • Routing layer hides lot of complexity • Hashing schema • Replication (N, R , W) • Failures • Read-Repair (online repair mechanism) • Hinted Handoff (Long term recovery mechanism) • Easy to add domain specific strategies • E.g. only do synchronous operations on nodes in the local data center • Client Side / Server Side / Hybrid
Voldemort Physical Deployment
Routing With Failures • Failure Detection • Requirements • Need to be very very fast • View of server state may be inconsistent • A can talk to B but C cannot • A can talk to C , B can talk to A but not to C • Currently done by routing layer (request timeouts) • Periodically retries failed nodes. • All requests must have hard SLAs • Other possible solutions • Central server • Gossip protocol • Need to look more into this.
Repair Mechanism Read Repair – Online repair mechanism Routing client receives values from multiple node Notify a node if you see an old value Only works for keys which are read after failures Hinted Handoff – If a write fails write it to any random node – Just mark the write as a special write – Each node periodically tries to get rid of all special entries Bootstrapping mechanism (We don’t have it yet) – If a node was down for long time Hinted handoff can generate ton of traffic Need a better way to bootstrap and clear hinted handoff tables 19/11/09 28 Proprietary & Confidential
Network Layer • Network is the major bottleneck in many uses • Client performance turns out to be harder than server (client must wait!) • Lots of issue with socket buffer size/socket pool • Server is also a Client • Two implementations • HTTP + servlet container • Simple socket protocol + custom server • HTTP server is great, but http client is 5-10X slower • Socket protocol is what we use in production • Recently added a non-blocking version of the server
Persistence • Single machine key-value storage is a commodity • Plugins are better than tying yourself to a single strategy • Different use cases • optimize reads • optimize writes • large vs small values • SSDs may completely change this layer • Better filesystems may completely change this layer • Couple of different options • BDB, MySQL and mmap’d file implementations • Berkeley DBs most popular • In memory plugin for testing • Btrees are still the best all-purpose structure • No flush on write is a huge, huge win
In Practice
LinkedIn problems we wanted to solve Application Examples • People You May Know • Item-Item Recommendations • Member and Company Derived Data • User’s network statistics • Who Viewed My Profile? • Abuse detection • User’s History Service • Relevance data • Crawler detection • Many others have come up since • Some data is batch computed and served as read only • Some data is very high write load • Latency is key •
Recommend
More recommend