A"“Hitchhiker’s”"Guide"to"Fast"and"Efficient"Data" Reconstruc:on"in"Erasure;coded"Data"Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur, K. Ramchandran
Need"for"Redundant"Storage"" in"Data"Centers" • "Frequent"unavailability"events"in"data"centers" – unreliable"components" – soHware"glitches,"maintenance"shutdowns,"" " "power"failures,"etc." • "Redundancy"necessary"for"reliability"and"availability" " "
Popular"Approach"for"Redundant"Storage:" Replica:on" • Distributed"file"systems"used"in"data"centers"store" mul:ple"copies"of"data"on"different"machines"" " • Machines"typically"chosen"on"different"racks"" – to"tolerate"rack"failures" " E.g.,"Hadoop"Distributed"File"System"(HDFS)"stores"""" 3"replicas"by"default" " "
HDFS" FILE% divide"into"blocks" d" e" f" g" h" i" j" a" b" c" introduce"redundancy" b" c" d" e" f" g" h" i" j" a" a" b" c" d" e" f" g" h" i" j" g" h" i" j" a" b" c" d" e" f" store"distributed" across"network" AS/Router" TOR" TOR" TOR" TOR" …% …% …% …%
Massive"Data"Sizes:"" Need"Alterna:ve"to"Replica:on" " • Small"to"moderately"sized"data:"disk"storage"is" inexpensive"" – replica:on"viable " • No"longer"true"for"massive"scales"of"opera:on" – e.g.,"Facebook"data"warehouse"cluster"stores" mul:ple"tens"of"Petabytes"(PBs)" “Erasure"codes”"are"an"alterna:ve"
Erasure"Codes"in"Data"Centers" " • Facebook"data"warehouse"cluster" – uses"Reed;Solomon"(RS)"codes"instead"of"3; replica:on"on"a"por:on"of"the"data" – savings'of'mul-ple'Petabytes'of'storage'space' "
Erasure"Codes" Replication Reed-Solomon (RS) code a a block 1 block 1 data"blocks" b b block 2 block 2 a+b a block 3 block 3 parity"blocks" b a+2b block 4 block 4 Overhead" 2x" 2x" Fault"" tolerates"any"two"failures" tolerates"any"one"failure" tolerance:" In"general,"erasure"codes"provide"orders"of"magnitude" higher"reliability"at"much"smaller"storage"overheads"
Outline" • Erasure"Codes"in"Data"Centers" – HDFS" • Impact"on"the"data"center"network" – Problem"descrip:on" " • Our"system:"“Hitchhiker”" " • Implementa:on"and"evalua:on" – Facebook"data"warehouse"cluster" • Literature" "
Outline" • Erasure"Codes"in"Data"Centers" – HDFS" • Impact"on"the"data"center"network" – Problem"descrip:on" " • Our"solu:on:"“Hitchhiker”" " • Implementa:on"and"evalua:on" – Facebook"data"warehouse"cluster" • Literature" "
Erasure"codes"in"Data"Centers:"" HDFS;RAID" a" b" c" d" e" f" g" h" i" j" Overhead:"3x" g" h" i" j" a" b" c" d" e" f" a" b" c" d" e" f" g" h" i" j" Overhead:"1.4x" b" c" d" e" f" g" h" i" j" P1" P2" P3" P4" a" (10,"4)"Reed;Solomon"code" Borthakur, “HDFS and Erasure Codes (HDFS-RAID)” ! Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09 !
Erasure"codes"in"Data"Centers:"" HDFS;RAID" a" b" c" d" e" f" g" h" i" j" Overhead:"3x" " g" h" i" j" a" b" c" d" e" f" Cannot"tolerate"" a" b" c" d" e" f" g" h" i" j" many"3;failures" Overhead:"1.4x" b" c" d" e" f" g" h" i" j" P1" P2" P3" P4" a" • Any"10"blocks"sufficient" (10,"4)"Reed;Solomon"code" • Can"tolerate"any"4;failures" Borthakur, “HDFS and Erasure Codes (HDFS-RAID)” ! Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09 !
Outline" • Erasure"Codes"in"Data"Centers" – HDFS" • Impact"on"the"data"center"network" – Problem"descrip:on" " • Our"system:"“Hitchhiker”" " • Implementa:on"and"evalua:on" – Facebook"data"warehouse"cluster" • Literature" "
Impact"on"Data"Center"Network" " • Degraded"Reads" Network"Layer" – reques:ng"currently" unavailable"data" – on;the;fly"reconstruc:on" Reconstruc:on"Opera:ons" • Recovery" – periodically"replace" unavailable"blocks" – to"ensure"desired"level"of" reliability" Storage"Layer"
Impact"on"Data"Center"Network" RS"codes"significantly"increase"network" usage"during"reconstruc:on"
Impact"on"Data"Center"Network" Reed-Solomon code Replication a a block 1 a a block 1 a" b" b block 2 Network Transfer a block 2 a+b" Network Transfer & disk IO & disk IO = 1x a+b = 2x block 3 b block 3 a+2b b block 4 block 4 Network"transfer"&"disk"IO" """"""""""="(#data;blocks)"x"(size"of"data"to"be"reconstructed)" In"(10,"4)"RS,"it"is"10x"
Impact"on"Data"Center"Network" Router" a % TOR" TOR" TOR" TOR" a % a% +% +% …% a % …% b % …% …% b % 2b % machine"1" machine"2" machine"3" machine"4" Burdens"the"already"oversubscribed" Top;of;Rack"and"higher"level"switches"
Impact"on"Data"Center"Network:"" Facebook"Data"Warehouse"Cluster" • Mul:ple"PB"of"Reed;Solomon"encoded"data " • Median"of"180"TB"transferred"across"racks"per"day"for"RS" reconstruc:on"≈"5":mes"that"under"3;replica:on " Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the Facebook Warehouse Cluster”, Usenix HotStorage Workhsop 2013 "
RS"codes:"The"Good"and"The"Bad" • Maximum"possible"fault;tolerance"for"given" storage"overhead"" – storage;capacity"op:mal"" – (“ maximum&distance&separable ”"in"coding"theory"parlance) " • Flexibility"in"choice"of"parameters" – Supports"any"number"of"data"and"parity"blocks " " • Not"designed"to"handle"reconstruc:on" opera:ons"efficiently" " – nega:ve"impact"on"the"network"
Goal% RS"codes:"The"Good"and"The"Bad" • Maximum"possible"fault;tolerance"for"given" storage"overhead"" – storage;capacity"op:mal"" Maintain" – (“ maximum&distance&separable ”"in"coding"theory"parlance) " • Flexibility"in"choice"of"parameters" – Supports"any"number"of"data"and"parity"blocks " " • Not"designed"to"handle"reconstruc:on" opera:ons"efficiently" " Improve" – nega:ve"impact"on"the"network"
Recommend
More recommend