To Towar ards ds In In-networ network k Accel Ac celeration of Er eration of Erasur asure e Cod Codin ing Yi Qiao, Xiao Kong, Menghao Zhang, Yu Zhou, Mingwei Xu, Jun Bi Tsinghua University
Eras rasure ure Coding ing (EC) • In data centers, machine failures happen very frequently. Facebook reports up to 50 machine failures per day in their data warehouses. • EC provides data fault tolerance with much lower storage overheads (~1.4x) than replication (3x), with similar degree of availability. • EC reconstructs missing data with remaining data and pre-calculated parities. • For example: • XOR (RAID 5) • Reed-Solomon Codes
EC C Examples xamples • Conclusion: EC reconstruction can be modelled with XOR (RAID 5) 𝑙 𝑛 = 𝑏 𝑗 𝑦 𝑗 , 𝑗=1 𝑏 𝑐 𝑑 𝑞 = 𝑏⨁𝑐⨁𝑑 𝑛 : reconstructed symbol 𝑦 𝑗 : symbols from remaining machines Reconstruct b with 𝑐 = 𝑏⨁𝑑⨁𝑞 𝑏 𝑗 : pre-computed coefficients • Addition refers to XOR Reed Solomon Code ( Conceptual ) • Multiplication is on Galois Field 𝑏 𝑐 𝑑 𝑞 1 = 𝑏 + 𝑐 + 𝑑 𝑞 2 = 𝑏 + 2𝑐 + 2𝑑 linear combinations These are Galois Field arithmetics. For simplicity, just comprehend them as integer arithmetics. Reconstruct a with 𝑏 = 2𝑞 1 − 𝑞 2 Reconstruct c with 𝑑 = 𝑞 2 − 𝑞 1 − 𝑐
EC Pro roblems blems • Low reconstruction rate • Several hours to reconstruct a disk • Several seconds for degraded reads • EC is mostly used for storing “cold” data in data warehouses. • Why so slow?
Line width represents throughput Motiva tivation tion Multiplexed Forward NIC ToR CPU 𝑩 𝑪 𝟐 𝑪 𝟑 𝑪 𝟒 DISK Disk Reconstruction Rate = 1/3 of available NIC capacity No NIC Sharing/multiplexing Forward NetEC NIC ToR CPU 𝑩 𝑪 𝟐 𝑪 𝟑 𝑪 𝟒 DISK Near 100% of available NIC capacity
Ne NetE tEC • We present NetEC that offloads EC reconstruction to programmable switches. • It improves reconstruction rates by k times, where k is the number of the machines to download from. • It also entirely removes CPU usage.
Bri rief ef Ov Over erview view of f Ne NetE tEC Da Data ta Pla lane ne 000 0 P1 𝒚 𝟐 On Switch Decoding Buffer 𝒃 𝟐 𝒚 𝟐 P1 arrives 𝐶 1 100 𝒃 𝟐 𝒚 𝟐 𝒛 𝟐 𝒃 𝟐 𝒛 𝟐 Progress Drop Partial XOR … … P2 arrives Tracker Sum Buffer 110 𝒃 𝟐 𝒚 𝟐 + 𝒃 𝟑 𝒚 𝟑 P2 𝒚 𝟑 𝒃 𝟑 𝒚 𝟑 GF … 𝐶 2 Drop P3 arrives 𝒛 𝟑 𝒃 𝟑 𝒛 𝟑 111 𝒃 𝟐 𝒚 𝟐 + 𝒃 𝟑 𝒚 𝟑 + 𝒃 𝟒 𝒚 𝟒 … Mult. … … P3 𝒃 𝟐 𝒚 𝟐 + 𝒃 𝟑 𝒚 𝟑 + 𝒃 𝟒 𝒚 𝟒 𝒚 𝟒 𝒃 𝟒 𝒚 𝟒 A 𝐶 3 …… …… 𝒃 𝟐 𝒛 𝟐 + 𝒃 𝟑 𝒛 𝟑 + 𝒃 𝟒 𝒛 𝟒 𝒛 𝟒 𝒃 𝟒 𝒛 𝟒 … … … ① ② ③ ④ ⑤ ⑥ ⑧ ⑦ Extracted In PHVs Stateful Registers
Cha halle llenges nges an and Des Design ign • Galois Field Multiplication Offloading • Rate Synchronization • Deep Payload Inspection/assembly
Cha halle llenges nges an and Des Design( ign(1) 1) • Galois Field Multiplication Offloading • We convert it to addition, logarithm and exponents • To calculate 𝒃 𝟐 𝒚 𝟐 , • Look up 𝒎𝒑𝒉(𝒚 𝟐 ) in the logarithm table • Add with a pre-known 𝒎𝒑𝒉(𝒃 𝟐 ) : 𝒎𝒑𝒉(𝒃 𝟐 𝒚 𝟐 ) = 𝒎𝒑𝒉(𝒃 𝟐 ) + 𝒎𝒑𝒉(𝒚 𝟐 ) • Look up 𝒃 𝟐 𝒚 𝟐 in the exponent table: 𝒃 𝟐 𝒚 𝟐 = 𝒇 𝒎𝒑𝒉(𝒃 𝟐 𝒚 𝟐 ) • Note that the logarithms and exponents are also on the Galois Field, where this method is valid. • Rate Synchronization • Deep Payload Inspection/assembly
Cha halle llenges nges an and Des Design( ign(2) 2) • Computation Offloading • Rate Synchronization • Switch has to temporarily buffer partial XOR sums since first packet arrives until last packet leaves. • One-to-many TCP • The switch only needs to buffer partial XOR sums whose size is equal to in-flight packets, bounded by BDP (bandwidth-delay product) • SSD peak write speed: 1GB/s • DC RTT : 250 us • BDP = 250KB • Deep Payload Inspection/assembly
Cha halle llenges nges an and Des Design( ign(3) 3) • Computation Offloading • Rate Synchronization • Deep Payload Inspection/assembly • Many switch constraints leads to limited number of processed bytes, while small-sized packets reduce throughput. • Use recirculation inspired by PPS (SOSR 19) • Redesign l4 checkcum updates.
Di Discu scussion ssions s an and li limitation mitations • Will NetEC cause incast? • NetEC actually prevents incast. • Most incoming packets are dropped in the ingress pipeline. • O utbound PPS ≈ Inbound PPS • Is NetEC scalable? • The number of machines to download from: 3, 6, 10 • The number of concurrent tasks • Problem: • Currently, a table or register can only be accessed once per packet, so that we need multiple logarithm/exponent tables. • Limited number of registers per stage.
Im Implementatio plementation n an and Eva valuat luation ion • We implement a prototype of NetEC on commodity switches, and integrate it with HDFS-EC.
Conc nclusion lusion • EC low reconstruction rate is due to multiplexed NIC capacity • In-network computation resolves this problem, leading to great performance improvement. • We design and implement NetEC, addressing three challenges, and conduct preliminary evaluations to show effectiveness.
Tha hank nk yo you! u!
Recommend
More recommend