A mathematical theory of distributed storage Dagstuhl workshop (16321) Coding in the time of Big Data Michael Luby August 8, 2016 Research
Cloud storage state of affairs Storage clusters contain thousands of storage nodes, with e.g. 500 TB capacity per node Clusters are built on commodity HW, failures are very frequent Durability of data achieved via replication (3 copies à 3x storage), too costly Daily Failed nodes in 3000 node FB production cluster (1 Month) Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the Facebook Warehouse Cluster”
Cloud storage • Triplication − High storage overhead (67%) and cost − Limited durability (2 failures) − Reactive repair à high repair bandwidth Ideal Cloud • Erasure Codes (RS) Storage − RS (9,6,3) à 33% storage overhead (MS) − RS (14,10,4) à 29% storage overhead (FB) − Better overhead and durability than triplication, but Triplication − High repair bandwidth Small erasure codes − Degraded access
Cloud storage Liquid cloud storage Ideal Cloud Storage Triplication Small erasure codes Liquid cloud storage
Quantitative comparison Liquid advantages Lower storage overhead Triplication 100000 Lower repair costs Peak repair BW per node [Mbps] WORSE Better durability RS 10000 Superior trade-offs k =382, r =20 Customize to infrastructure 1000 k =335, r =67 MTTDL (durability) k =268, r =134 Liquid 10 7 years : Liquid BETTER 100 10 6 years : Reed-Solomon BETTER 10 5 years : Triplication WORSE 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Storage overhead b = r/n
Overview Mathematical model of distributed storage − Based on an understanding of deployed systems − Models distributed storage system − Models maintaining recoverability of message when nodes can fail − Enables analysis of storage overhead & repair bandwidth trade-offs Information-theoretic lower bounds − Fundamental lower bounds on trade-offs Algorithmic upper bounds − Matching algorithmic upper bounds on trade-offs − Using standard erasure codes
A mathematical model of communication – Shannon Received message signal message signal Source Transmitter Receiver Destination x z y Noise
A mathematical model of distributed storage Time T passes between storage and access reads recovered Nodes Nodes message writes data message Source Storer Accessor Destination x data y z Failures
Storage nodes model Read data … Local Node 1 Node 2 Node 3 Node n Repairer memory Write data Nodes can fail
Node failure model When a node fails − All data stored at a node is immediately lost − Failed node is immediately replaced by a new node − Bits at new node are initialized to zeroes Failure process − Determines when nodes fail − Determines what nodes fail
System overview Storer − Writes data to the nodes generated from message x received from a source Repairer − Aware of when and what nodes fail − Repair process (continual loop) − Read data from nodes − Generate new data from the read data − Write new data to nodes Accessor − Reads data y from the nodes − Generates z from y and provides z to a destination Goal – recovered message z = original message x
Bounds Lower bounds − There is a failure process, so that for any repairer: − The average repair traffic is above a threshold function of the storage overhead − Information theoretic Upper bounds − There is a repairer, so that for any failure process: − The peak repair traffic is below a threshold function of the storage overhead − Algorithmic, based on Liquid cloud storage algorithms − Large erasure codes − Lazy repair strategy − Flow data organization
Failure process Failure timing – determines when nodes fail − T Fixed = fixed timing, i.e., Δ -duration between failures − T Random = random timing, i.e., Poisson with Δ -duration average between failures Failure pattern – determines what nodes fail − P Random = random pattern, i.e., random node fails − P Adversarial = adversarial pattern, i.e., failed node chosen based on all available information (T Random , P Adversarial )-failures (T Random , P Random )-failures (T Fixed , P Adversarial )-failures (T Fixed , P Random )-failures
Repairers Deterministic − Previous failure process actions determine next repairer action Randomized − Repairer can use a source of random bits to determine actions − Random bits are private to repairer (not available to the failure process) Deterministic repairer Randomized repairer
(T Fixed , P Adversarial )-failures bounds Bounds on storage overhead versus repairer traffic Upper bound (T Fixed , P Adversarial )-failures Deterministic repairer Lower bound Bounds are equal (asymptotically as storage overhead goes to zero)
Main bounds Main upper bound (T Random , P Adversarial )-failures Deterministic repairer (T Random , P Random )-failures (T Fixed , P Adversarial )-failures (T Fixed , P Random )-failures Randomized repairer Main lower bound Both main bounds apply to random failures Bounds are equal (asymptotically as storage overhead goes to zero)
Definitions for bounds Storage overhead − β = 1- m / c = storage overhead − m = size of message x − c = n s = total storage capacity − n = number of storage nodes − s = storage capacity per node Repairer read rate − R AVG = lower bound on average repair read rate − R PEAK = upper bound on peak repair read rate Durability MTTDL = mean time till at least some of the message is unrecoverable
(T Fixed , P Adversarial )-failures bounds Lower bound when β = 0.25 ! ! !"# ≥ 0 . 815 ∙ 2 ! ∙ Δ − is necessary to guarantee message recovery Upper bound when β = 0.25 ! ! !"#$ ≤ 1 . 31 ∙ 2 ! ∙ Δ − is sufficient to guarantee message recovery x Asymptotic as β —> 0 ! ! !"# → 2 ! ∙ Δ ← ! !"#$ − X±
Main results as storage overhead β —> 0 Lower bound − (T Fixed , P Random )-failures interacting with randomized repairer ! ! !"# ≥ 2 ! ∙ Δ − is necessary to achieve a large MTTDL Upper bound − (T Random , P Adversarial )-failures interacting with deterministic repairer ! ! !"#$ ≤ 2 ! ∙ Δ − is sufficient to achieve a large MTTDL
Visualization of bounds tradeoffs Repairer read rate 10.00 12.00 0.00 2.00 4.00 6.00 8.00 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 (T_Fixed, P_Random)-Lower Bound (T_Fixed, P_Adversarial)-Lower Bound 1/2 β Bound Upper Bound 0.14 0.15 0.16 0.17 0.18 0.19 Storage overhead 0.20 0.21 0.22 0.23 0.24 0.25 ( R PEAK ) 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 β 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 ( R AVG ) 0.45 0.46 ( R AVG ) 0.47 0.48 0.49 Δ = 1 s = 1
Visualization of ratio: upper and lower to asymptotic 1.40 Normalized repairer read rate 1.20 1.00 0.80 Upper*2 β 0.60 (T_Fixed, P_Adversarial)-Lower*2 β 0.40 (T_Fixed, P_Random)-Lower*2 β 0.20 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 β Storage overhead
The (lower bound) game Repairer-failure process game − Repairer trying to ensure message is recoverable − Failure process trying to make message unrecoverable Transcript − Record of interactions between repairer and failure process − When nodes fail − What nodes fail − What bits read by repairer, etc.
Snapshot interval ( t , t’ ) Snapshot at time t − The c stored bits at the nodes at time t At time t’ > t r ( ! , ! ! ) − = # snapshot bits read between t and t’ ! ( ! , ! ! ) − = # snapshot bits lost (erased before being read) between t and t’ ! ( ! , ! ! ) − = # snapshot bits unmodified between t and t’ ! = ! ! , ! ! + ! ! , ! ! + ! ( ! , ! ! ) − is an invariant Initially ! ! , ! = ! ! ! , ! = ! ! , ! = 0 , − x Claim ! ! , ! ! + ! ! , ! ! < ! − If then message unrecoverable at time t’ ! ! , ! ! > ! − ! = ! ∙ ! − If then message unrecoverable at time t’
Intuition Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes ( β = 1/3) Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Message may be recoverable Message is not recoverable
Intuition analysis Suppose erased and read bits are disjoint − Implies all erased bits are lost − Necessary condition for message recoverability: − Repairer needs to read m bits before failure process erases c - m bits ! / Δ − Failure process erases bits at a rate ( ! − ! ) ∙ Δ = ( 1 − ! ) ∙ ! ! ∙ ! − Repairer needs to read bits at a rate of at least ! ∙ Δ Generally erased and read bits not disjoint − Repairer can read bits from node before node fails − The bits that have been read are not lost if the node fails − Number of bits lost when node fails can be less than s
Snapshot interval evolution Message size m is capacity of 6 nodes Storage overhead c - m is capacity of 3 nodes ( β = 1/3) Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9
Recommend
More recommend