Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed ed Pe Persistent Memory File System Yo Youyou Lu Lu 1 , , Ji Jiwu Sh Shu 1 , , Yo Youmin Ch Chen en 1 , , Ta Tao Li Li 2 1 Ts Tsinghua University 2 Uni University of Flori rida da 1
Ou Outline • Back ckground and Motivation • Octopus Design • Evaluation • Conclusion 2
NV NVMM & & R RDMA • NVMM ( PCM, ReRAM, etc) • RDMA • Data persistency • Remote direct access • Byte-addressable • Bypass remote kernel • Low latency • Low latency and high throughput Client Server Registered Memory Registered Memory A C E B D CPU HCA HCA CPU 3
Mod Modular-Desi Designed ed Di Distri tributed ed File e System em Latency (1KB write+sync) • Di DiskGluster Overall 18 ms Disk for data storage • Di HDD 98 % • Gi GigE for communication Network Software 2 % • Me MemGluster Overall 324 us • Me Memory for data storage MEM • RD RDMA A for communication RDMA Software 99.7 % 4
Mod Modular-Desi Designed ed Di Distri tributed ed File e System em Bandwidth (1MB write) • Di DiskGluster Disk for data storage • Di HDD 88 MB/s • Gi GigE for communication Network 118 MB/s File System 83 MB/s 94 % • Me MemGluster • Me Memory for data storage MEM 6509 MB/s • RD RDMA A for communication 6350 MB/s RDMA File System 27 % 1779MB/s 5
RD RDMA-en enab abled led Dis Distrib ibuted ed File ile System em • Mo More e than fast hardware • It It is is subop optimal imal to o simp imply ly rep eplac lace e the e ne netw twork/st storage mod module le • Op Opport rtunities es and Challen enges es • NV NVM • By Byte-ad addressab abilit ility • Si Sign gnificant overhead of of data cop opies • RD RDMA • Fl Flexi xible p programming v verb rbs (m (messag age/memory s seman antic ics) • Imba Imbalanc nced ed CPU pr proces essing ng capa pacity vs. ne network I/ I/Os Os 6
Ou Outline • Background and Motivation • Oct ctopus Design • Evaluation • Conclusion 7
RD RDMA-en enab abled led Dis Distrib ibuted ed File ile System em Opportunity Approaches Byte-addressability of NVM Shared data managements One-sided RDMA verbs New data flow strategies CPU is the new bottleneck Flexible RDMA verbs Efficient RPC Primitive • It It is is nec eces essar ary y to re rethink the design of of D DFS o over N NVM & & RD RDMA 8
Oct Octopus A Arch chitect cture Read(“/home/lyy”) create(“/home/cym”) Client A Client B RDMA-based Data IO Self-Identified RPC Shared Persistent Memory Pool It performs remote direct data access just NVMM NVMM ... NVMM like an Octopus uses its eight legs N 2 N 2 N 3 HCA HCA HCA 9
1. 1. Shared Persi sistent t Me Memor ory Pool ool • Existing DFSs GlusterFS 7 copy • Redundant data copy Client Server User Space Buffer User Space Buffer Page mbuf mbuf Cache NIC NIC FS Image 10
1. 1. Shared Persi sistent t Me Memor ory Pool ool • Existing DFSs GlusterFS + DAX 6 copy • Redundant data copy Client Server User Space Buffer User Space Buffer Page mbuf mbuf Cache NIC NIC FS Image 11
1. Shared Persi 1. sistent t Me Memor ory Pool ool • Octopus with SPMP • Introduces the sh shared persi sistent memory • Existing DFSs po pool • Redundant data copy • Global view of data layout 4 copy Client Server User Space Buffer User Space Buffer Message mbuf mbuf Pool NIC NIC FS Image 12
2. . Client-Ac Acti tive e Da Data I/ I/O SERVER C 1 C 2 NIC CPU MEM time • Server-Active • Server threads process the data I/O • Works well for slow Ethernet • CPUs can easily become the bottleneck with fast hardware Lookup file data Send data • Client-Active • Let clients read/write data directly from/to the SPMP 13 Lookup file data Send address
3. 3. Self-Id Iden entif ified ied Metad adata a RPC PC Thead1 Theadn • Message-based RPC HCA Message Pool • easy to implement, lower throughput • Da DaRPC PC [S ], Fa FaSST [O [SoCC’14], [OSDI’16] • Memory-based RPC HCA DATA • CPU cores scan the message buffer • Fa FaRM [N [NSDI’14] Thead1 Theadn • Using rdma_write_with_imm? • Scan by polling HCA Message Pool • Imm data for self-identification HCA ID DATA 14
Ou Outline • Background and Motivation • Octopus Design • Evaluation • Conclusion 15
Ev Evaluation Setup • Evaluation Platform Cluster CPU Memory ConnectX-3 FDR Number A E5-2680 * 2 384 GB Yes * 5 B E5-2620 16 GB Yes * 7 • Connected with Mellanox SX1012 switch • Evaluated Distributed File Systems • memGluster, runs on memory, with RDMA connection • NVFS [O SU] , Crail [I [IBM] , optimized to run on RDMA [OSU • memHDFS, Alluxio, for big data comparison 16
Ov Overall E Effici ciency cy Latency Breakdown Bandwidth Utilization 100% 7000 6000 95% 5000 90% 4000 3000 85% 2000 80% 1000 75% 0 getattr readdir Write Read software mem network software mem network • Software latency is reduced from 326 u 326 us to 6 6 us us • Achieves read/write bandwidth that approaches the raw storage and network bandwidth 17
Me Metadata Operati tion on Perf rform ormance MKNOD GETATTR RMNOD 8 .E+06 6 .E+05 3 .E+05 8 .E+05 6 .E+04 8 .E+04 3 .E+04 6 .E+03 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 glusterfs nvfs glusterfs nvfs glusterfs nvfs crail dmfs crail dmfs crail dmfs crail-poll crail-poll crail-poll • Octopus provides metadata IOPS in the order of 10 # ~10 % • Octopus can scales linearly 18
Bi Big Da Data Evaluati tion on TestDFSIO (MB/s) Normalized Execution Time 3000 1.1 2500 1 2000 0.9 1500 0.8 1000 0.7 500 0 0.6 write read Teragen Wordcount memHDFS Alluxio NVFS Crail Octopus memHDFS Alluxio NVFS Crail Octopus • Octopus can also provide better performance for big data applications than existing file systems. 19
Con Conclusi sion on • It is necessary to rethink the DFS designs over emerging H/Ws • Octopus’s internal mechanisms • Simplifies data management layer by re reducing data ta copies Rebalances network and server loads with Client-Active I/O • Re • Redesigns the me metad adata a RPC and di distr tribut buted d tr transa nsacti tion n with RDMA primitives • Evaluations show that Octopus significantly outperforms existing file systems 20
Q& Q&A Th Thanks 21
Recommend
More recommend