rethinking end to end reliability in cloud storage systems
play

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - PowerPoint PPT Presentation

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019 Denser flash shorter lifetime Denser flash TLC Error rate


  1. RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019

  2. Denser flash  shorter lifetime Denser flash TLC Error rate MLC Acceptable error rate SLC SLC lifetime TLC lifetime MLC lifetime Number of writes Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1. 2

  3. Shorter flash lifetimes are a problem • Datacenter operators must closely monitor flash writes How can we increase • Memory : flash cost ratio is increasing  workloads moving from DRAM to flash flash lifetimes?  increases pressure on flash • Datacenters struggling to adopt future generations of flash (e.g., QLC) 3

  4. Increasing acceptable error rate  increase lifetimes TLC Error rate MLC Acceptable error rate SLC TLC lifetime Number of writes Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1. 4

  5. But.. hardware is expected to have low error rates • Software is designed so bit errors are rare • Bit errors errors cause failed operations and reduced availability • Error-handling path is not performant 5

  6. Distributed error Isolation and RECovery Techniques (DIRECT) 1. Use distributed redundancy to fix local bit errors • Distributed systems need redundant copies for availability 2. Optimize error-recovery performance  flash devices can expose high error rates  flash devices have longer lifetimes  cheaper flash devices (QLC and beyond) 6

  7. Bit errors in the storage stack… Distributed Coordination / Replication Layer . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash 7

  8. … can manifest in the file system Distributed Coordination / Replication Layer . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks) 8

  9. …or in the local data store Distributed Coordination / Replication Layer . . . local data store local data store RocksDB local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks)  Application metadata or data 10

  10. …and need to be dealt with in the coordination layer Distributed Coordination / Replication Layer Paxos / ZooKeeper . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks)  Application metadata or data • Correct recovery 11

  11. DIRECT corrects bit errors in the local data store Distributed Coordination / Replication Layer DIRECT . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash 12

  12. Local data store errors: metadata Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata X (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) Data objects Data objects Data objects local data store local data store local data store 14

  13. DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 15

  14. Local data store errors: data Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects local data store local data store local data store 16

  15. DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas 17

  16. Optimizing error recovery: strawman treats bit errors as unavailability events Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects Copy entire node Prohibitively slow 18

  17. Optimizing error recovery: strawman treats bit errors as unavailability events Distributed Coordination / Replication Layer (PAR) How to isolate data DIRECT necessary for recovery? Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects Copy entire node Prohibitively slow 19

  18. DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 20

  19. DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery 21

  20. Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X A A A object replicas 22

  21. Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X ’ A A A recovery request write operation 23

  22. Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X ’ ’ ’ A A A write operation 24

  23. Naïve recovery protocol: inconsistency Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) ’ ’ A A A 25

  24. DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery  With respect to system’s consistency guarantees 26

  25. Implementations of DIRECT • ZippyDB/RocksDB • RocksDB: KV store backed by log-structured merge tree • ZippyDB: distributed KV store backed by RocksDB • HDFS: Block-level distributed file system 47

  26. ZippyDB Overview Write Write Write request request request ZippyDB RocksDB RocksDB RocksDB Secondary Primary Secondary 48

  27. ZippyDB Overview Coordination Layer Write Write Write request request request ZippyDB RocksDB RocksDB RocksDB RocksDB = Local data store Secondary Primary Secondary 49

  28. How ZippyDB handles corruptions • User reads: retry from another server • Background reads (compaction): crash server 50

  29. ZippyDB-DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery  With respect to system’s consistency guarantees 51

  30. RocksDB SST file layout Data block 1 Data block 1 . . . . . . Data block N Data block N Metadata block 1 Metadata block 1 Metadata block 1 Metadata block 2 Metadata block 2 . Metadata block 2 . . . Index block Index block footer Index block footer footer 52

  31. ZippyDB-DIRECT 1. Protect and fix errors in local metadata  With local replication of metadata 2. Fix errors in data objects with replicas  Minimize amount of data required from other replicas  Challenging in logically-replicated systems 3. Safe recovery  With respect to system’s consistency guarantees 53

  32. Identifying corrupt data Data block 1 . No way of knowing the . . exact key-value pair! X Data block N Metadata block 1 Metadata block 1 Metadata block 2 Metadata block 2 . . Index block Index block footer footer 54

Recommend


More recommend