Project Voldemort Jay Kreps 19/11/09 1 The Plan 1. Motivation 2. - PowerPoint PPT Presentation

Project Voldemort Jay Kreps 19/11/09 1

The Plan 1. Motivation 2. Core Concepts 3. Implementation 4. In Practice 5. Results

Motivation

The Team • LinkedIn’s Search, Network, and Analytics Team • Project Voldemort • Search Infrastructure: Zoie, Bobo, etc • LinkedIn’s Hadoop system • Recommendation Engine • Data intensive features • People you may know • Who’s viewed my profile • User history service

The Idea of the Relational Database

The Reality of a Modern Web Site

Why did this happen? • The internet centralizes computation • Specialized systems are efficient (10-100x) • Search: Inverted index • Offline: Hadoop, Terradata, Oracle DWH • Memcached • In memory systems (social graph) • Specialized system are scalable • New data and problems • Graphs, sequences, and text

Services and Scale Break Relational DBs • No joins • Lots of denormalization • ORM is less helpful • No constraints, triggers, etc • Caching => key/value model • Latency is key

Two Cheers For Relational Databases • The relational model is a triumph of computer science: • General • Concise • Well understood • But then again: • SQL is a pain • Hard to build re-usable data structures • Don’t hide the memory hierarchy! Good: Filesystem API Bad: SQL, some RPCs

Other Considerations • Who is responsible for performance (engineers? DBA? site operations?) • Can you do capacity planning? • Can you simulate the problem early in the design phase? • How do you do upgrades? • Can you mock your database?

Some motivating factors • This is a latency-oriented system • Data set is large and persistent • Cannot be all in memory • Performance considerations • Partition data • Delay writes • Eliminate network hops • 80% of caching tiers are fixing problems that shouldn’t exist • Need control over system availability and data durability • Must replicate data on multiple machines • Cost of scalability can’t be too high

Inspired By Amazon Dynamo & Memcached Amazon’s Dynamo storage system • • Works across data centers • Eventual consistency • Commodity hardware • Not too hard to build  Memcached – Actually works – Really fast – Really simple  Decisions: – Multiple reads/writes – Consistent hashing for data distribution – Key-Value model – Data versioning

Priorities 1. Performance and scalability 2. Actually works 3. Community 4. Data consistency 5. Flexible & Extensible 6. Everything else

Why Is This Hard? • Failures in a distributed system are much more complicated • A can talk to B does not imply B can talk to A • A can talk to B does not imply C can talk to B • Getting a consistent view of the cluster is as hard as getting a consistent view of the data • Nodes will fail and come back to life with stale data • I/O has high request latency variance • I/O on commodity disks is even worse • Intermittent failures are common • User must be isolated from these problems • There are fundamental trade-offs between availability and consistency

Core Concepts

Core Concepts - I  ACID – Great for single centralized server.  CAP Theorem – Consistency (Strict), Availability , Partition Tolerance – Impossible to achieve all three at same time in distributed platform – Can choose 2 out of 3 – Dynamo chooses High Availability and Partition Tolerance  by sacrificing Strict Consistency to Eventual consistency  Consistency Models – Strict consistency  2 Phase Commits  PAXOS : distributed algorithm to ensure quorum for consistency – Eventual consistency  Different nodes can have different views of value  In a steady state system will return last written value.  BUT Can have much strong guarantees. 19/11/09 16 Proprietary & Confidential

Core Concept - II  Consistent Hashing  Key space is Partitioned – Many small partitions  Partitions never change – Partitions ownership can change  Replication – Each partition is stored by ‘N’ nodes  Node Failures – Transient (short term) – Long term  Needs faster bootstrapping 19/11/09 17 Proprietary & Confidential

Core Concept - III • N - The replication factor • R - The number of blocking reads • W - The number of blocking writes • If R+W > N then we have a quorum-like algorithm • Guarantees that we will read latest writes OR fail • • R, W, N can be tuned for different use cases W = 1, Highly available writes • R = 1, Read intensive workloads • Knobs to tune performance, durability and availability • 19/11/09 18 Proprietary & Confidential

Core Concepts - IV • Vector Clock [Lamport] provides way to order events in a distributed system. • A vector clock is a tuple {t1 , t2 , ..., tn } of counters. • Each value update has a master node • When data is written with master node i, it increments ti. • All the replicas will receive the same version • Helps resolving consistency between writes on multiple replicas • If you get network partitions • You can have a case where two vector clocks are not comparable. • In this case Voldemort returns both values to clients for conflict resolution 19/11/09 19 Proprietary & Confidential

Implementation

Voldemort Design

Client API • Data is organized into “stores”, i.e. tables • Key-value only • But values can be arbitrarily rich or complex • Maps, lists, nested combinations … • Four operations • PUT (K, V) • GET (K) • MULTI-GET (Keys), • DELETE (K, Version) • No Range Scans

Versioning & Conflict Resolution • Eventual Consistency allows multiple versions of value • Need a way to understand which value is latest • Need a way to say values are not comparable • Solutions • Timestamp • Vector clocks Provides global ordering. • No locking or blocking necessary •

Serialization • Really important • Few Considerations • Schema free? • Backward/Forward compatible • Real life data structures • Bytes <=> objects <=> strings? • Size (No XML) • Many ways to do it -- we allow anything • Compressed JSON, Protocol Buffers, Thrift, Voldemort custom serialization

Routing • Routing layer hides lot of complexity • Hashing schema • Replication (N, R , W) • Failures • Read-Repair (online repair mechanism) • Hinted Handoff (Long term recovery mechanism) • Easy to add domain specific strategies • E.g. only do synchronous operations on nodes in the local data center • Client Side / Server Side / Hybrid

Voldemort Physical Deployment

Routing With Failures • Failure Detection • Requirements • Need to be very very fast • View of server state may be inconsistent • A can talk to B but C cannot • A can talk to C , B can talk to A but not to C • Currently done by routing layer (request timeouts) • Periodically retries failed nodes. • All requests must have hard SLAs • Other possible solutions • Central server • Gossip protocol • Need to look more into this.

Repair Mechanism  Read Repair – Online repair mechanism  Routing client receives values from multiple node  Notify a node if you see an old value  Only works for keys which are read after failures  Hinted Handoff – If a write fails write it to any random node – Just mark the write as a special write – Each node periodically tries to get rid of all special entries  Bootstrapping mechanism (We don’t have it yet) – If a node was down for long time  Hinted handoff can generate ton of traffic  Need a better way to bootstrap and clear hinted handoff tables 19/11/09 28 Proprietary & Confidential

Network Layer • Network is the major bottleneck in many uses • Client performance turns out to be harder than server (client must wait!) • Lots of issue with socket buffer size/socket pool • Server is also a Client • Two implementations • HTTP + servlet container • Simple socket protocol + custom server • HTTP server is great, but http client is 5-10X slower • Socket protocol is what we use in production • Recently added a non-blocking version of the server

Persistence • Single machine key-value storage is a commodity • Plugins are better than tying yourself to a single strategy • Different use cases • optimize reads • optimize writes • large vs small values • SSDs may completely change this layer • Better filesystems may completely change this layer • Couple of different options • BDB, MySQL and mmap’d file implementations • Berkeley DBs most popular • In memory plugin for testing • Btrees are still the best all-purpose structure • No flush on write is a huge, huge win

In Practice

LinkedIn problems we wanted to solve Application Examples • People You May Know • Item-Item Recommendations • Member and Company Derived Data • User’s network statistics • Who Viewed My Profile? • Abuse detection • User’s History Service • Relevance data • Crawler detection • Many others have come up since • Some data is batch computed and served as read only • Some data is very high write load • Latency is key •

Project Voldemort Jay Kreps 19/11/09 1 The Plan 1. Motivation 2. - PowerPoint PPT Presentation

Project Voldemort Jay Kreps 19/11/09 1 The Plan 1. Motivation 2. Core Concepts 3. Implementation 4. In Practice 5. Results Motivation The Team LinkedIns Search, Network, and Analytics Team Project Voldemort Search

Background Distributed Key/Value stores provide a simple put / get interface Great

Welcome It used to be easy they all looked pretty much alike NoSQL BigData MapReduce Graph

Why NoSQL? Why Riak? Justin Sheehy justin@basho.com 1 What's all of this NoSQL nonsense?

Akie Project Akie Project Akie Project Akie Project & & Kechika Regional Project

Kartepe Project Location Kartepe Project Features Kartepe Project Flat Detail

evaluations An unsuccessful project? An unsuccessful project? An unsuccessful project? A

PROJECT CONCEPT 2 Project Introduction 13 Mar 2017 Project Location and Access 4 Project

February 2012 DAWEI Sea Port Project DAWEI PROJECT DAWEI AND THE REGION PROJECT LOCATION

Project Planning and Project Management Week 2: Project Life Cycles Kay Dudman 1 Last week

PROJECTS Team work Scientist/researcher Programmer/coder (Matlab, C,..)

III. Project Specific Matters Project Area, Methodologies, Water Use Statistics Water

Project Roundtable Project Roundtable What Keeps Project Managers What Keeps Project

National Hydrology Project National Hydrology Project National Hydrology Project National

Project X and TeamCenter Chuck Grimm 30 July 2013 Project X and TeamCenter Project X and

BPR4GDPR Project Presentation Project ID Project acronym: BPR4GDPR Project title:

Presentation Agenda Project History/Background Project Overview Current Project Status

Distributed Adaptive Systems (DAS) Unit Self-organising P2P Antonio Bucchiarone Fondazione Bruno

Infrastructures for Cloud Computing and Big Data Global Data Storage Antonio Corradi, Luca

in Sparse Mobile Networks Thomas Plagemann & Katrine S. Skjelsvik Distributed Multimedia

CS 839: Design the Next-Generation Database Lecture 20: OLTP in Cloud Xiangyao Yu 4/2/2020 1

METAPHORS WE COMPUTE BY THE YEAR IS 1980 GEORGE LAKOFF & MARK JOHNSON METAPHORS WE LIVE BY

Correctness of Tendermint-core Blockchains Y. Amoussou-Guenou ^,* , A. Del Pozzo ^ , M.

Distributed ledgers: how, why, and why not? Sarah Meiklejohn (University College London) company

D i s t r i b u t e d S t o r a g e S y s t e m s John Leach

Project Voldemort Jay Kreps 19/11/09 1 The Plan 1. Motivation 2. - PowerPoint PPT Presentation

Project Voldemort Jay Kreps 19/11/09 1 The Plan 1. Motivation 2. Core Concepts 3. Implementation 4. In Practice 5. Results Motivation The Team LinkedIns Search, Network, and Analytics Team Project Voldemort Search

Background Distributed Key/Value stores provide a simple put / get interface Great

Welcome It used to be easy they all looked pretty much alike NoSQL BigData MapReduce Graph

Why NoSQL? Why Riak? Justin Sheehy justin@basho.com 1 What's all of this NoSQL nonsense?

Akie Project Akie Project Akie Project Akie Project &amp; &amp; Kechika Regional Project

Kartepe Project Location Kartepe Project Features Kartepe Project Flat Detail

evaluations An unsuccessful project? An unsuccessful project? An unsuccessful project? A

PROJECT CONCEPT 2 Project Introduction 13 Mar 2017 Project Location and Access 4 Project

February 2012 DAWEI Sea Port Project DAWEI PROJECT DAWEI AND THE REGION PROJECT LOCATION

Project Planning and Project Management Week 2: Project Life Cycles Kay Dudman 1 Last week

PROJECTS Team work Scientist/researcher Programmer/coder (Matlab, C,..)

III. Project Specific Matters Project Area, Methodologies, Water Use Statistics Water

Project Roundtable Project Roundtable What Keeps Project Managers What Keeps Project

National Hydrology Project National Hydrology Project National Hydrology Project National

Project X and TeamCenter Chuck Grimm 30 July 2013 Project X and TeamCenter Project X and

BPR4GDPR Project Presentation Project ID Project acronym: BPR4GDPR Project title:

Presentation Agenda Project History/Background Project Overview Current Project Status

Distributed Adaptive Systems (DAS) Unit Self-organising P2P Antonio Bucchiarone Fondazione Bruno

Infrastructures for Cloud Computing and Big Data Global Data Storage Antonio Corradi, Luca

in Sparse Mobile Networks Thomas Plagemann &amp; Katrine S. Skjelsvik Distributed Multimedia

CS 839: Design the Next-Generation Database Lecture 20: OLTP in Cloud Xiangyao Yu 4/2/2020 1

METAPHORS WE COMPUTE BY THE YEAR IS 1980 GEORGE LAKOFF &amp; MARK JOHNSON METAPHORS WE LIVE BY

Correctness of Tendermint-core Blockchains Y. Amoussou-Guenou ^,* , A. Del Pozzo ^ , M.

Distributed ledgers: how, why, and why not? Sarah Meiklejohn (University College London) company

D i s t r i b u t e d S t o r a g e S y s t e m s John Leach

Akie Project Akie Project Akie Project Akie Project & & Kechika Regional Project

in Sparse Mobile Networks Thomas Plagemann & Katrine S. Skjelsvik Distributed Multimedia

METAPHORS WE COMPUTE BY THE YEAR IS 1980 GEORGE LAKOFF & MARK JOHNSON METAPHORS WE LIVE BY