RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019
Denser flash shorter lifetime Denser flash TLC Error rate MLC Acceptable error rate SLC SLC lifetime TLC lifetime MLC lifetime Number of writes Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1. 2
Shorter flash lifetimes are a problem • Datacenter operators must closely monitor flash writes How can we increase • Memory : flash cost ratio is increasing workloads moving from DRAM to flash flash lifetimes? increases pressure on flash • Datacenters struggling to adopt future generations of flash (e.g., QLC) 3
Increasing acceptable error rate increase lifetimes TLC Error rate MLC Acceptable error rate SLC TLC lifetime Number of writes Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1. 4
But.. hardware is expected to have low error rates • Software is designed so bit errors are rare • Bit errors errors cause failed operations and reduced availability • Error-handling path is not performant 5
Distributed error Isolation and RECovery Techniques (DIRECT) 1. Use distributed redundancy to fix local bit errors • Distributed systems need redundant copies for availability 2. Optimize error-recovery performance flash devices can expose high error rates flash devices have longer lifetimes cheaper flash devices (QLC and beyond) 6
Bit errors in the storage stack… Distributed Coordination / Replication Layer . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash 7
… can manifest in the file system Distributed Coordination / Replication Layer . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks) 8
…or in the local data store Distributed Coordination / Replication Layer . . . local data store local data store RocksDB local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks) Application metadata or data 10
…and need to be dealt with in the coordination layer Distributed Coordination / Replication Layer Paxos / ZooKeeper . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash Errors in File System: • File system metadata (inodes, etc.) • File system data (data blocks) Application metadata or data • Correct recovery 11
DIRECT corrects bit errors in the local data store Distributed Coordination / Replication Layer DIRECT . . . local data store local data store local data store hardened file system hardened file system hardened file system (e.g., ZFS) (e.g., ZFS) (e.g., ZFS) unreliable flash unreliable flash unreliable flash 12
Local data store errors: metadata Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata X (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) Data objects Data objects Data objects local data store local data store local data store 14
DIRECT 1. Protect and fix errors in local metadata With local replication of metadata 15
Local data store errors: data Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects local data store local data store local data store 16
DIRECT 1. Protect and fix errors in local metadata With local replication of metadata 2. Fix errors in data objects with replicas 17
Optimizing error recovery: strawman treats bit errors as unavailability events Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects Copy entire node Prohibitively slow 18
Optimizing error recovery: strawman treats bit errors as unavailability events Distributed Coordination / Replication Layer (PAR) How to isolate data DIRECT necessary for recovery? Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X Data objects Data objects Data objects Copy entire node Prohibitively slow 19
DIRECT 1. Protect and fix errors in local metadata With local replication of metadata 2. Fix errors in data objects with replicas Minimize amount of data required from other replicas Challenging in logically-replicated systems 20
DIRECT 1. Protect and fix errors in local metadata With local replication of metadata 2. Fix errors in data objects with replicas Minimize amount of data required from other replicas Challenging in logically-replicated systems 3. Safe recovery 21
Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X A A A object replicas 22
Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X ’ A A A recovery request write operation 23
Naïve recovery protocol Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) X ’ ’ ’ A A A write operation 24
Naïve recovery protocol: inconsistency Distributed Coordination / Replication Layer DIRECT Local metadata Local metadata Local metadata (version number, (version number, (version number, server ID, index, etc) . . . server ID, index, etc) server ID, index, etc) ’ ’ A A A 25
DIRECT 1. Protect and fix errors in local metadata With local replication of metadata 2. Fix errors in data objects with replicas Minimize amount of data required from other replicas Challenging in logically-replicated systems 3. Safe recovery With respect to system’s consistency guarantees 26
Implementations of DIRECT • ZippyDB/RocksDB • RocksDB: KV store backed by log-structured merge tree • ZippyDB: distributed KV store backed by RocksDB • HDFS: Block-level distributed file system 47
ZippyDB Overview Write Write Write request request request ZippyDB RocksDB RocksDB RocksDB Secondary Primary Secondary 48
ZippyDB Overview Coordination Layer Write Write Write request request request ZippyDB RocksDB RocksDB RocksDB RocksDB = Local data store Secondary Primary Secondary 49
How ZippyDB handles corruptions • User reads: retry from another server • Background reads (compaction): crash server 50
ZippyDB-DIRECT 1. Protect and fix errors in local metadata With local replication of metadata 2. Fix errors in data objects with replicas Minimize amount of data required from other replicas Challenging in logically-replicated systems 3. Safe recovery With respect to system’s consistency guarantees 51
RocksDB SST file layout Data block 1 Data block 1 . . . . . . Data block N Data block N Metadata block 1 Metadata block 1 Metadata block 1 Metadata block 2 Metadata block 2 . Metadata block 2 . . . Index block Index block footer Index block footer footer 52
ZippyDB-DIRECT 1. Protect and fix errors in local metadata With local replication of metadata 2. Fix errors in data objects with replicas Minimize amount of data required from other replicas Challenging in logically-replicated systems 3. Safe recovery With respect to system’s consistency guarantees 53
Identifying corrupt data Data block 1 . No way of knowing the . . exact key-value pair! X Data block N Metadata block 1 Metadata block 1 Metadata block 2 Metadata block 2 . . Index block Index block footer footer 54
Recommend
More recommend