Pangaea: Wide-area File System Taming Aggressive Replication in the - PowerPoint PPT Presentation

Pangaea: Wide-area File System Taming Aggressive Replication in the Pangaea o Support the daily Wide-area File System storage needs of distributed users. Y. Saito, C. Kaamanolis, M. Karlsson, M. Mahalingam o Enable ad-hoc data sharing. Presented by Jason Waddle Pangaea Design Goals Pangaea Assumptions (Non-goals) Speed � Servers are trusted I. Hide wide-area latency, � file access time ~ local file system Availability & autonomy II. � Weak data consistency is sufficient Avoid single point-of-failure � Adapt to churn (consistency in seconds) � Network economy III. Minimize use of wide-area network � Exploit physical locality �

Symbiotic Design Symbiotic Design Autonomous Each server operates when disconnected from network. Symbiotic Design Pervasive Replication Autonomous Cooperative � Replicate at file/directory level � Aggressively create replicas: whenever a file or directory is accessed � No single “master” replica � A replica may be read / written at any time Each server operates when When connected, servers � Replicas exchange updates in a peer-to- disconnected from network. cooperate to enhance overall peer fashion performance and availability.

Graph-based Replica Benefits of Graph-based Management Approach � Replicas connected in a sparse, strongly- � Inexpensive � Graph is sparse, adding/removing replicas O(1) connected, random graph � Available update distribution � Updates propagate along edges � As long as graph is connected, updates reach every � Edges used for discovery and removal replica � Network economy � High connectivity for close replicas, build spanning tree along fast edges Optimistic Replica Coordination Optimistic Replica Coordination � Aim for maximum availability over strong � “Eventual consistency” (~ 5s in tests) data-consistency � No strong consistency guarantees: � Any node issues updates at any time no support for locks, lock-files, etc. � Update transmission and and conflict resolution in background

Pangaea Structure Server Structure Region I/O request (<5ms RTT) (application) NFS protocol Pangaea server handler log Replication engine membership User space Server or Node Kernel space Inter-node NFS client communication Server Modules Server Modules � NFS protocol handler � NFS protocol handler � Receives requests from apps, updates local replicas, � Receives requests from apps, updates local replicas, generates requests to generates requests to � Replication engine � Accepts local and remote requests � Modifies replicas � Forwards requests to other nodes

Server Modules Server Modules � Membership module maintains: � NFS protocol handler � Receives requests from apps, updates local replicas, � List of regions, their members, estimated RTT generates requests to between regions � Replication engine � Location of root directory replicas � Accepts local and remote requests � Information coordinated by gossiping � Modifies replicas � “Landmark” nodes bootstrap newly joining � Forwards requests to other nodes nodes � Log module Maintaining RTT information: main scalability bottleneck � Transaction-like semantics for local updates File System Structure File System Structure � Gold replicas � Listed in directory entries � Form clique in replica graph /joe � Fixed number (e.g., 3) � All replicas (gold and bronze) � Unidirectional edges to all gold replicas � Bidirectional peer-edges /joe/foo � Backpointer to parent directory

File System Structure File Creation struct Replica struct DirEntry � Select locations for g gold replicas (e.g., g =3) � One on current server fid: FileID fname: String � Others on random servers from different regions ts: TimeStamp fid: FileID � Create entry in parent directory vv: VersionVector downlinks: Set(NodeID) � Flood updates goldPeers: Set(NodeID) ts: TimeStamp � Update to parent directory peers: Set(NodeID) � File contents (empty) to gold replicas backptrs: Set(FileID, String) Replica Creation Replica Creation � Recursively get replicas for ancestor � Select m peer-edges (e.g., m =4) directories � Include a gold replica (for future shortcutting) � Find a close replica (shortcutting) � Include closest neighbor from a random gold replica � Send request to the closest gold replica � Get remaining nodes from random walks � Gold replica forwards request to its neighbor starting at a random gold replica closest to requester, who then sends � Create m bidirectional peer-edges

Bronze Replica Removal Replica Updates � To recover disk space � Flood entire file to replica graph neighbors � Using GD-Size algorithm, throw out largest, � Updates reach all replicas as long as the least-accessed replica graph is strongly connected � Drop useless replicas � Optional: user can block on update until all � Too many updates before an access (e.g., 4) neighbors reply (red-button mode) � Must notify peer-edges of removal; peers � Network economy??? use random walk to choose new edge Optimized Replica Updates Optimized Replica Updates � Don’t send large (e.g., > 1KB) updates to � Send only differences ( deltas) each of m neighbors � Include old timestamp, new timestamp � Instead, use harbingers to dynamically � Only apply delta to replica if old timestamp build a spanning-tree update graph matches � Harbinger: small message with update’s � Revert to full-content transfer if necessary timestamps � Merge deltas when possible � Send updates along spanning-tree edges � Happens in two phases

Optimized Replica Updates Update Example (Phase 1) � Exploit Physical Topology B � Before pushing a harbinger to a neighbor, add a random delay ~ RTT (e.g., 10*RTT) F A C � Harbingers propagate down fastest links first � Dynamically builds an update spanning-tree with fast edges D E Update Example (Phase 1) Update Example (Phase 1) B B F F A A C C D E D E

Update Example (Phase 1) Update Example (Phase 1) B B F F A A C C D E D E Update Example (Phase 1) Update Example (Phase 2) B B F F A A C C D E D E

Update Example (Phase 2) Update Example (Phase 2) B B F F A A C C D E D E Regular File Conflict Conflict Resolution (Three Solutions) � Use a combination of version vectors and 1) Last-writer-wins, using update last-writer wins to resolve timestamps • Requires server clock synchronization � If timestamps mismatch, full-content is transferred 2) Concatenate both updates � Missing update: just overwrite replica • Make the user fix it 3) Possibly application-specific resolution

Directory Conflict Directory Conflict alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo /bob replica set /alice replica set Directory Conflict Temporary Failure Recovery � Log outstanding remote operations alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo � Update, random walk, edge addition, etc. Let the child (foo) decide! � Retry logged updates • Implement mv as a change to the file’s backpointer � On reboot � On recovery of another node •Single file resolves conflicting updates � Can create superfluous edges •File then updates affected directories � Retains m -connectedness

Permanent Failures Permanent Failure � Gold replica on failed node � A garbage collector (GC) scans for failed nodes � Discovered by another gold (clique) � Chooses new gold by random walk � Bronze replica on failed node � Flood choice to all replicas � GC causes replica’s neighbors to replace link � Update parent directory to contain new gold with a new peer using random walk replica nodes � Resolve conflicts with last-writer-wins � Expensive! Performance – LAN Performance – Slow Link Andrew-Tcl benchmarks, time in seconds The importance of local replicas

Performance – Roaming Performance: Non-uniform Net A model of HP’s corporate network. Compile on C1 then time compile on C2. Pangaea utilizes fast links to a peer’s replicas. Performance: Non-uniform Net Performance: Update Propagation Harbinger time is the window of inconsistency.

Performance: Large Scale Performance: Large Scale HP: 3000 Node 7-region HP Network HP: 3000 Node 7-region HP Network U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s Latency improves with more replicas. Network economy improves with more replicas. Performance: Availability Numbers in parenthesis are relative storage overhead.

Pangaea: Wide-area File System Taming Aggressive Replication in the - PowerPoint PPT Presentation

Pangaea: Wide-area File System Taming Aggressive Replication in the Pangaea o Support the daily Wide-area File System storage needs of distributed users. Y. Saito, C. Kaamanolis, M. Karlsson, M. Mahalingam o Enable ad-hoc data sharing.

Community Feeling @ Pangaea International Staff Week 13/05/2019 Pangaea 2 Cel

Financial Impacts of Achieving Aggressive Financial Impacts of Achieving Aggressive Financial

Pangaea Logistics Solutions Ltd. Reports Financial Results for the Three Months Ended June 30,

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

Software Tool Seminar WS1516 - Taming the Snake November 4, 2015 1 Taming the Snake 1.1

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File System Implementation Summer 2016 Cornell University Today File allocation Unix

FILE SYSTEM IMPLEMENTATION Sunu Wibirama Outline File-System Structure File-System

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Chapter 12: File System Implementation File System Structure File System Implementation

redis cluster or: distributed systems are hard Jan-Erik Rediger 28. Mai 2015 Hi, Im Jan-Erik

Scaling Automated Database Monitoring at Uber with M3 and Prometheus Richard Artoul Agenda

Distributed Storage Systems part 2 Marko Vukoli Distributed Systems and Cloud Computing

iSocial meeting Sarunas Girdzijauskas, KTH September 19, Barcelona User: bieuxv.tmp Pass:

RapidChain: Scaling Blockchain via Full Sharding Mahdi Zamani Visa Research Join work with

Secure Scuttlebutt: An Identity-Centric Protocol for Subjective and Decentralized Applications

Distributed Systems Principles and Paradigms Chapter 04 (version September 13, 2007 ) Maarten

Chemical Networking Protocols Thomas Meyer and Christian Tschudin University of Basel,

Pangaea: Wide-area File System Taming Aggressive Replication in the - PowerPoint PPT Presentation

Pangaea: Wide-area File System Taming Aggressive Replication in the Pangaea o Support the daily Wide-area File System storage needs of distributed users. Y. Saito, C. Kaamanolis, M. Karlsson, M. Mahalingam o Enable ad-hoc data sharing.

Community Feeling @ Pangaea International Staff Week 13/05/2019 Pangaea 2 Cel

Financial Impacts of Achieving Aggressive Financial Impacts of Achieving Aggressive Financial

Pangaea Logistics Solutions Ltd. Reports Financial Results for the Three Months Ended June 30,

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

Software Tool Seminar WS1516 - Taming the Snake November 4, 2015 1 Taming the Snake 1.1

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File System Implementation Summer 2016 Cornell University Today File allocation Unix

FILE SYSTEM IMPLEMENTATION Sunu Wibirama Outline File-System Structure File-System

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

Chapter 12: File System Implementation File System Structure File System Implementation

redis cluster or: distributed systems are hard Jan-Erik Rediger 28. Mai 2015 Hi, Im Jan-Erik

Scaling Automated Database Monitoring at Uber with M3 and Prometheus Richard Artoul Agenda

Distributed Storage Systems part 2 Marko Vukoli Distributed Systems and Cloud Computing

iSocial meeting Sarunas Girdzijauskas, KTH September 19, Barcelona User: bieuxv.tmp Pass:

RapidChain: Scaling Blockchain via Full Sharding Mahdi Zamani Visa Research Join work with

Secure Scuttlebutt: An Identity-Centric Protocol for Subjective and Decentralized Applications

Distributed Systems Principles and Paradigms Chapter 04 (version September 13, 2007 ) Maarten

Chemical Networking Protocols Thomas Meyer and Christian Tschudin University of Basel,

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of