Hakim W Hakim Weather eatherspoon spoon Joint with Lakshmi - PowerPoint PPT Presentation

Hakim W Hakim Weather eatherspoon spoon Joint with Lakshmi Ganesh, Tudor Marian, Mahesh Balakrishnan, and Ken Birman File and Storage Technologies (FAST) San Francisco, California February 26 th , 2009

 U.S. Department of Treasury Study • Financial Sector vulnerable to significant data loss in disaster • Need new technical options  Risks are real, technology available, Why is problem not solved?

Conundrum: async there is no middle ground sync Primary site Remote mirror  Want asynchronous performance to local data center  And want synchronous guarantee

Conundrum: there is no middle ground sync Remote-sync Local-sync Primary site Remote mirror  Want asynchronous performance to local data center  And want synchronous guarantee

 How can we increase reliability of local-sync protocols? • Given many enterprises use local-sync mirroring anyways  Different levels of local-sync reliability • Send update to mirror immediately • Delay sending update to mirror – deduplication reduces BW

 Introduction  Enterprise Continuity • How data loss occurs • How we prevent it • A possible solution  Evaluation  Discussion and Future Work  Conclusion

 Rather, where do failures occur? Packet Partition loss Site Power Failure Outage Primary site Remote mirror  Rolling disasters

Network-sync Remote-sync Local-sync Primary site Remote mirror Wide-area network

Primary site Data Packet Remote mirror Repair Packet Network-level Ack Storage-level Ack  Use network level redundancy and exposure • reduces probability data lost due to network failure

 Network-sync increases data reliability • reduces data loss failure modes, can prevent data loss if • At the same time primary site fail network drops packet • And ensure data not lost in send buffers and local queues  Data loss can still occur • Split second(s) before/after primary site fails… • Network partitions • Disk controller fails at mirror • Power outage at mirror  Existing mirroring solutions can use network-sync

 A file system constructed over network-sync • Transparently mirrors files over wide-area • Embraces concept: file is in transit (in the WAN link) but with enough recovery data to ensure that loss rates are as low as for the remote disk case! • Group mirroring consistency

V1 R1 I1 I2 B1 B2 B3 B4 append ( B1,B2 ) V1 R1 I2 B4 B3 I1 B2 B1 append ( V1.. )

 Introduction  Enterprise Continuity  Evaluation  Conclusion

 Demonstrate SMFS performance over Maelstrom • In the event of disaster, how much data is lost? • What is system and app throughput as link loss increases? • How much are the primary and mirror sites allowed to diverge?  Emulab setup • 1 Gbps, 25ms to 100ms link connects two data centers • Eight primary and eight mirror storage nodes • 64 testers submit 512kB appends to separate logs  Each tester submits only one append at a time

Local- Network- Remote- sync sync sync Primary site Remote mirror - 50 ms one-way latency - FEC(r,c) = (8,3)  Local-sync unable to recover data dropped by network  Local-sync+FEC lost data not in transit  Network-sync did not lose any data • Represents a new tradeoff in design space

100000 Local- Network- Remote- 10000 sync sync sync 1000 # Messages Primary site Remote mirror 100 10 - 50 ms one-way latency - FEC(r,c) = (8, varies ) 1 - 1% link loss 0.1 0 1 2 3 Value of C Local-sync+FEC total msgs sent Network-sync total msgs sent Unrecoverable lost msgs  c = 0, No recovery packets: data loss due to packet loss  c = 1, not sufficient to mask packet loss either  c > 2, can mask most packet loss  Network-sync can prevent loss in local buffers

 App throughput measures application perceived performance  Network and Local-sync+FEC tput significantly greater than Remote-sync(+FEC)

 Introduction  Enterprise Continuity  Evaluation  Discussion and Future Work  Conclusion

 Do (semi-)private lambda networks drop packets? • E.g. Teragrid  Cornell National Lambda Rail (NLR) Rings testbed • Up to 0.5% loss  Scale network-sync solution to 10Gbps and beyond • Commodity (multi-core) hardware

 Introduction  Enterprise Continuity  Evaluation  Discussion and Future Work  Conclusion

 Technology response to critical infrastructure needs  When does the filesystem return to the application? • Fast — return after sending to mirror • Safe — return after ACK from mirror  SMFS — return to user after sending enough FEC  Network-sync: Lossy Network  Lossless Network  Disk!  Result: Fast, Safe Mirroring independent of link length!

 Questions? Email: hweather@cs.cornell.edu Network-sync code available: http://fireless.cs.cornell.edu/~tudorm/maelstrom Cornell National Lambda Rail (NLR) Rings testbesb http://www.cs.cornell.edu/~hweather/nlr

Hakim W Hakim Weather eatherspoon spoon Joint with Lakshmi - PowerPoint PPT Presentation

Hakim W Hakim Weather eatherspoon spoon Joint with Lakshmi Ganesh, Tudor Marian, Mahesh Balakrishnan, and Ken Birman File and Storage Technologies (FAST) San Francisco, California February 26 th , 2009 U.S. Department of Treasury Study

CS 6410: Advanced Systems Fall 2013 Instructor: Hakim Weatherspoon TA: Erluo Li and Qin Jia Who

Mon stent fait le yoyo: comment sen sortir? Radwan HAKIM Hpitaux de Chartres DCLARATION

CS 6410: ADVANCED SYSTEMS PROF. HAKIM WEATHERSPOON Fall 2018 A PhD-oriented course about

CS 3410: Computer System Organization and Programming Hakim Weatherspoon Computer Science

SUPERCLOUD: GOING BEYOND FEDERATED CLOUDS Hakim Weatherspoon Robbert van Renesse 1

VIRTUALIZATION: IBM VM/370 AND XEN Hakim Weatherspoon CS6410 IBM VM/370 Robert Jay Creasy

Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science

Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science

Syscalls, exceptions, and interrupts, oh my! Hakim Weatherspoon CS 3410 Computer Science

Syscalls, exceptions, and interrupts, oh my! Hakim Weatherspoon CS 3410 Computer Science

Syscalls, exceptions, and interrupts, oh my! Hakim Weatherspoon CS 3410 Computer Science

Syscalls, exceptions, and interrupts, oh my! Hakim Weatherspoon CS 3410 Computer Science

CONCURRENCY, THREADS, AND EVENTS Hakim Weatherspoon CS6410 On the Duality of Operating System

Performance Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon,

MULTIPROCESSORS AND HETEROGENEOUS ARCHITECTURES Hakim Weatherspoon CS6410 Slides borrowed

Calling Conventions Hakim Weatherspoon CS 3410 Computer Science Cornell University

The RISC-V Processor Hakim Weatherspoon CS 3410 Computer Science Cornell University

Fault-Tolerant State Machine Replication Chinasa T. Okolo 1 Slides borrowed from Hakim

Pipelining Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala,

Memory Prof. Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon,

The Role of Large Architectural Principles in Systems Presented by Hakim Weatherspoon Creation

MODERN SYSTEMS: EXTENSIBLE KERNELS AND CONTAINERS Hakim Weatherspoon CS6410 Motivation 2

on Precise Realtime Software Access and Control of Wired Networks Ki Suh Lee, Han Wang, Hakim

Rmi Hakim Datadog remi@datadoghq.com @remhak github.com/remh Follow @honest_update on