Pangaea: Wide-area File System Taming Aggressive Replication in the Pangaea o Support the daily Wide-area File System storage needs of distributed users. Y. Saito, C. Kaamanolis, M. Karlsson, M. Mahalingam o Enable ad-hoc data sharing. Presented by Jason Waddle Pangaea Design Goals Pangaea Assumptions (Non-goals) Speed � Servers are trusted I. Hide wide-area latency, � file access time ~ local file system Availability & autonomy II. � Weak data consistency is sufficient Avoid single point-of-failure � Adapt to churn (consistency in seconds) � Network economy III. Minimize use of wide-area network � Exploit physical locality �
Symbiotic Design Symbiotic Design Autonomous Each server operates when disconnected from network. Symbiotic Design Pervasive Replication Autonomous Cooperative � Replicate at file/directory level � Aggressively create replicas: whenever a file or directory is accessed � No single “master” replica � A replica may be read / written at any time Each server operates when When connected, servers � Replicas exchange updates in a peer-to- disconnected from network. cooperate to enhance overall peer fashion performance and availability.
Graph-based Replica Benefits of Graph-based Management Approach � Replicas connected in a sparse, strongly- � Inexpensive � Graph is sparse, adding/removing replicas O(1) connected, random graph � Available update distribution � Updates propagate along edges � As long as graph is connected, updates reach every � Edges used for discovery and removal replica � Network economy � High connectivity for close replicas, build spanning tree along fast edges Optimistic Replica Coordination Optimistic Replica Coordination � Aim for maximum availability over strong � “Eventual consistency” (~ 5s in tests) data-consistency � No strong consistency guarantees: � Any node issues updates at any time no support for locks, lock-files, etc. � Update transmission and and conflict resolution in background
Pangaea Structure Server Structure Region I/O request (<5ms RTT) (application) NFS protocol Pangaea server handler log Replication engine membership User space Server or Node Kernel space Inter-node NFS client communication Server Modules Server Modules � NFS protocol handler � NFS protocol handler � Receives requests from apps, updates local replicas, � Receives requests from apps, updates local replicas, generates requests to generates requests to � Replication engine � Accepts local and remote requests � Modifies replicas � Forwards requests to other nodes
Server Modules Server Modules � Membership module maintains: � NFS protocol handler � Receives requests from apps, updates local replicas, � List of regions, their members, estimated RTT generates requests to between regions � Replication engine � Location of root directory replicas � Accepts local and remote requests � Information coordinated by gossiping � Modifies replicas � “Landmark” nodes bootstrap newly joining � Forwards requests to other nodes nodes � Log module Maintaining RTT information: main scalability bottleneck � Transaction-like semantics for local updates File System Structure File System Structure � Gold replicas � Listed in directory entries � Form clique in replica graph /joe � Fixed number (e.g., 3) � All replicas (gold and bronze) � Unidirectional edges to all gold replicas � Bidirectional peer-edges /joe/foo � Backpointer to parent directory
File System Structure File Creation struct Replica struct DirEntry � Select locations for g gold replicas (e.g., g =3) � One on current server fid: FileID fname: String � Others on random servers from different regions ts: TimeStamp fid: FileID � Create entry in parent directory vv: VersionVector downlinks: Set(NodeID) � Flood updates goldPeers: Set(NodeID) ts: TimeStamp � Update to parent directory peers: Set(NodeID) � File contents (empty) to gold replicas backptrs: Set(FileID, String) Replica Creation Replica Creation � Recursively get replicas for ancestor � Select m peer-edges (e.g., m =4) directories � Include a gold replica (for future shortcutting) � Find a close replica (shortcutting) � Include closest neighbor from a random gold replica � Send request to the closest gold replica � Get remaining nodes from random walks � Gold replica forwards request to its neighbor starting at a random gold replica closest to requester, who then sends � Create m bidirectional peer-edges
Bronze Replica Removal Replica Updates � To recover disk space � Flood entire file to replica graph neighbors � Using GD-Size algorithm, throw out largest, � Updates reach all replicas as long as the least-accessed replica graph is strongly connected � Drop useless replicas � Optional: user can block on update until all � Too many updates before an access (e.g., 4) neighbors reply (red-button mode) � Must notify peer-edges of removal; peers � Network economy??? use random walk to choose new edge Optimized Replica Updates Optimized Replica Updates � Don’t send large (e.g., > 1KB) updates to � Send only differences ( deltas) each of m neighbors � Include old timestamp, new timestamp � Instead, use harbingers to dynamically � Only apply delta to replica if old timestamp build a spanning-tree update graph matches � Harbinger: small message with update’s � Revert to full-content transfer if necessary timestamps � Merge deltas when possible � Send updates along spanning-tree edges � Happens in two phases
Optimized Replica Updates Update Example (Phase 1) � Exploit Physical Topology B � Before pushing a harbinger to a neighbor, add a random delay ~ RTT (e.g., 10*RTT) F A C � Harbingers propagate down fastest links first � Dynamically builds an update spanning-tree with fast edges D E Update Example (Phase 1) Update Example (Phase 1) B B F F A A C C D E D E
Update Example (Phase 1) Update Example (Phase 1) B B F F A A C C D E D E Update Example (Phase 1) Update Example (Phase 2) B B F F A A C C D E D E
Update Example (Phase 2) Update Example (Phase 2) B B F F A A C C D E D E Regular File Conflict Conflict Resolution (Three Solutions) � Use a combination of version vectors and 1) Last-writer-wins, using update last-writer wins to resolve timestamps • Requires server clock synchronization � If timestamps mismatch, full-content is transferred 2) Concatenate both updates � Missing update: just overwrite replica • Make the user fix it 3) Possibly application-specific resolution
Directory Conflict Directory Conflict alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo /bob replica set /alice replica set Directory Conflict Temporary Failure Recovery � Log outstanding remote operations alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo � Update, random walk, edge addition, etc. Let the child (foo) decide! � Retry logged updates • Implement mv as a change to the file’s backpointer � On reboot � On recovery of another node •Single file resolves conflicting updates � Can create superfluous edges •File then updates affected directories � Retains m -connectedness
Permanent Failures Permanent Failure � Gold replica on failed node � A garbage collector (GC) scans for failed nodes � Discovered by another gold (clique) � Chooses new gold by random walk � Bronze replica on failed node � Flood choice to all replicas � GC causes replica’s neighbors to replace link � Update parent directory to contain new gold with a new peer using random walk replica nodes � Resolve conflicts with last-writer-wins � Expensive! Performance – LAN Performance – Slow Link Andrew-Tcl benchmarks, time in seconds The importance of local replicas
Performance – Roaming Performance: Non-uniform Net A model of HP’s corporate network. Compile on C1 then time compile on C2. Pangaea utilizes fast links to a peer’s replicas. Performance: Non-uniform Net Performance: Update Propagation Harbinger time is the window of inconsistency.
Performance: Large Scale Performance: Large Scale HP: 3000 Node 7-region HP Network HP: 3000 Node 7-region HP Network U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s Latency improves with more replicas. Network economy improves with more replicas. Performance: Availability Numbers in parenthesis are relative storage overhead.
Recommend
More recommend