15 853 algorithms in the real world error correcting
play

15-853:Algorithms in the Real World Error Correcting Codes (cont..) - PowerPoint PPT Presentation

15-853:Algorithms in the Real World Error Correcting Codes (cont..) Scribe volunteers: ? Announcement: Scribe notes template and instructions on the course webpage 15-853 Page1 General Model Noise introduced by the channel: message


  1. 15-853:Algorithms in the Real World Error Correcting Codes (cont..) Scribe volunteers: ? Announcement: Scribe notes template and instructions on the course webpage 15-853 Page1

  2. General Model “Noise” introduced by the channel: message (m) • changed fields in the codeword vector (e.g. a flipped bit). encoder • Called errors codeword (c) • missing fields in the codeword noisy vector (e.g. a lost byte). channel • Called erasures codeword’ (c’) decoder How the decoder deals with errors and/or erasures? • detection (only needed for message or error errors) • correction 15-853 Page2

  3. Block Codes message (m) Each message and codeword is of fixed size coder  = codeword alphabet k =|m| n = |c| q = |  | codeword (c) noisy C = “code” = set of codewords channel C  S n (codewords) codeword ’ (c’) decoder D (x,y) = number of positions s.t. x i  y i d = min{ D (x,y) : x,y  C, x  y} message or error Code described as: (n,k,d) q 15-853 Page3

  4. Role of Minimum Distance Theorem: A code C with minimum distance “d” can: 1. detect any (d-1) errors 2. recover any (d-1) erasures 3. correct any <write> errors Stated another way: For s-bit error detection d  s + 1 For s-bit error correction d  2s + 1 To correct a erasures and b errors if d  a + 2b + 1 15-853 Page4

  5. Next we will see an application of erasure codes in today’s large -scale data storage systems 15-853 Page 5

  6. Large-scale distributed storage systems 1000s of interconnected servers 100s of petabytes of data • Commodity components • Software issues, power failures, maintenance shutdowns

  7. Large-scale distributed storage systems 1000s of interconnected servers Unavailabilities are the norm 100s of petabytes of data rather than the exception • Commodity components • Software issues, power failures, maintenance shutdowns

  8. Facebook analytics cluster in production: unavailability statistics • Multiple thousands of servers • Unavailability event: server unresponsive for > 15 min 350 300 250 #unavailability 200 events 150 100 50 median: 52 0 0 5 10 15 20 25 30 day [Rashmi, Shah, Gu, Kuang, Borthakur, Ramchandran, USENIX HotStorage 2013 and ACM SIGCOMM 2014]

  9. Facebook analytics cluster in production: unavailability statistics • Multiple thousands of servers • Unavailability event: server unresponsive for > 15 min 350 300 250 #unavailability Daily server unavailability = 0.5 - 1% 200 events 150 100 50 median: 52 0 0 5 10 15 20 25 30 day [Rashmi, Shah, Gu, Kuang, Borthakur, Ramchandran, USENIX HotStorage 2013 and ACM SIGCOMM 2014]

  10. Servers unavailable Data inaccessible Applications cannot wait, Data cannot be lost Data needs to be stored in a redundant fashion

  11. Traditional approach: Replication • Storing multiple copies of data: Typically 3x-replication “blocks” a b c d a b b c d d a c 3 replicas b d a a b c c d a a b b c c d d distributed on servers across network … …

  12. Traditional approach: Replication • Storing multiple copies of data: Typically 3x-replication “blocks” a b c d Too expensive for large-scale data a b b c d d a c 3 replicas b d a a b c c d a a b b c c d d Better alternative: sophisticated codes distributed on servers across network … …

  13. Two data blocks to be stored: and a b Tolerate any 2 failures a block 1 a block 1 a block 2 b block 2 a block 3 a+b block 3 b block 4 a+2b block 4 b block 5 “parity blocks” block 6 b 3-replication Erasure code Storage overhead = 3x Storage overhead = 2x

  14. Two data blocks to be stored: and a b Tolerate any 2 failures a block 1 a block 1 a block 2 Much less storage b block 2 a block 3 for desired fault tolerance a+b block 3 b block 4 a+2b block 4 b block 5 “parity blocks” block 6 b 3-replication Erasure code Storage overhead = 3x Storage overhead = 2x

  15. Erasure codes: how are they used in distributed storage systems? Example: a b d f h j c e g i a a b b c c d d e e f f g g h h i i j j P1 P2 P3 P4 P1 P2 P3 P4 10 data blocks 4 parity blocks distributed to servers … …

  16. Almost all large-scale storage systems today employ erasure codes Facebook, Google, Amazon, Microsoft... “Considering trends in data growth & datacenter hardware, we foresee HDFS erasure coding being an important feature in years to come ” - Cloudera Engineering (September, 2016)

  17. Error Correcting Multibit Messages We will first discuss Hamming Codes Named after Richard Hamming (1915-1998), a pioneer in error-correcting codes and computing in general. 15-853 Page17

  18. Error Correcting Multibit Messages We will first discuss Hamming Codes Codes are of form: (2 r -1, 2 r -1 – r, 3) for any r > 1 e.g. (3,1,3), (7,4,3), (15,11,3), (31, 26, 3), … which correspond to 2, 3, 4, 5, … “parity bits” (i.e. n -k) Question: Error detection and correction capability? (Can detect 2-bit errors, or correct 1-bit errors.) The high- level idea is to “localize” the error. 15-853 Page18

  19. Hamming Codes: Encoding r = 4 Localizing error to top or bottom half 1xxx or 0xxx m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 m 3 p 0 p 8 = m 15  m 14  m 13  m 12  m 11  m 10  m 9 Localizing error to x1xx or x0xx m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 0 p 4 = m 15  m 14  m 13  m 12  m 7  m 6  m 5 Localizing error to xx1x or xx0x m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 2 p 0 p 2 = m 15  m 14  m 11  m 10  m 7  m 6  m 3 Localizing error to xxx1 or xxx0 m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 2 p 1 p 0 p 1 = m 15  m 13  m 11  m 9  m 7  m 5  m 3 15-853 Page19

  20. Hamming Codes: Decoding m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 2 p 1 p 0 We don’t need p 0 , so we have a (15,11,?) code. After transmission, we generate b 8 = p 8  m 15  m 14  m 13  m 12  m 11  m 10  m 9 b 4 = p 4  m 15  m 14  m 13  m 12  m 7  m 6  m 5 b 2 = p 2  m 15  m 14  m 11  m 10  m 7  m 6  m 3 b 1 = p 1  m 15  m 13  m 11  m 9  m 7  m 5  m 3 With no errors, these will all be zero With one error b 8 b 4 b 2 b 1 gives us the error location. e.g. 0100 would tell us that p 4 is wrong, and 1100 would tell us that m 12 is wrong 15-853 Page20

  21. Hamming Codes Can be generalized to any power of 2 – n = 2 r – 1 (15 in the example) – (n-k) = r (4 in the example) – Can correct one error – d ≥ 3 (since we can correct one error) – Gives (2 r -1, 2 r -1-r, 3) code (We will later see an easy way to prove the minimum distance) Extended Hamming code – Add back the parity bit at the end – Gives (2 r , 2 r -1-r, 4) code – Can still correct one error, but now can detect 3 15-853 Page21

  22. A Lower bound on parity bits: Hamming bound How many nodes in hypercube do we need so that d = 3? Each of 2 k codewords eliminates n neighbors plus itself, i.e. n+1   n k 2 ( n 1 ) 2    n k log ( n 1 ) 2      n k log ( n 1 ) 2 In above Hamming code, 15  11 +  log 2 (15+1)  = 15. Hamming Codes are called perfect codes since they match the lower bound exactly. 15-853 Page22

  23. A Lower bound on parity bits: Hamming bound What about fixing 2 errors (i.e. d=5)? Each of the 2 k codewords eliminates itself, its neighbors and its neighbors’ neighbors, giving: <board> Generally to correct s errors:       n n n              log 2 ( 1 ) n k             1 2 s 15-853 Page23

  24. Lower Bounds: a side note The lower bounds assume arbitrary placement of bit errors. In practice errors are likely to have patterns: maybe evenly spaced, or clustered: x x x x x x x x x x x x Can we do better if we assume regular errors ? We will come back to this later when we talk about Reed- Solomon codes. This is a big reason why Reed-Solomon codes are used much more than Hamming-codes. 15-853 Page24

  25. Q: If no structure in the code, how would one perform encoding? <board> Gigantic lookup table! If no structure in the code, encoding is highly inefficient. A common kind of structure added is linearity 15-853 Page25

  26. Linear Codes If  is a field, then  n is a vector space Definition : C is a linear code if it is a linear subspace of  n of dimension k. This means that there is a set of k independent vectors v i   n (1  i  k) that span the subspace. i.e. every codeword can be written as: where a i   c = a 1 v 1 + a 2 v 2 + … + a k v k “Basis (or spanning) Vectors” 15-853 Page26

  27. Some Properties of Linear Codes 1. Linear combination of two codewords is a codeword. <board> 2. Minimum distance (d) = weight of least weight (non-zero) codewords <Write proof> 15-853 Page27

  28. Generator and Parity Check Matrices 3. Every linear code has two matrices associated with it. 1. Generator Matrix : A k x n matrix G such that: C = { xG | x   k } Made from stacking the spanning vectors k n n mesg = codeword G 15-853 Page28

Recommend


More recommend