Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed - PowerPoint PPT Presentation

Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed ed Pe Persistent Memory File System Yo Youyou Lu Lu 1 , , Ji Jiwu Sh Shu 1 , , Yo Youmin Ch Chen en 1 , , Ta Tao Li Li 2 1 Ts Tsinghua University 2 Uni University of Flori rida da 1

Ou Outline • Back ckground and Motivation • Octopus Design • Evaluation • Conclusion 2

NV NVMM & & R RDMA • NVMM ( PCM, ReRAM, etc) • RDMA • Data persistency • Remote direct access • Byte-addressable • Bypass remote kernel • Low latency • Low latency and high throughput Client Server Registered Memory Registered Memory A C E B D CPU HCA HCA CPU 3

Mod Modular-Desi Designed ed Di Distri tributed ed File e System em Latency (1KB write+sync) • Di DiskGluster Overall 18 ms Disk for data storage • Di HDD 98 % • Gi GigE for communication Network Software 2 % • Me MemGluster Overall 324 us • Me Memory for data storage MEM • RD RDMA A for communication RDMA Software 99.7 % 4

Mod Modular-Desi Designed ed Di Distri tributed ed File e System em Bandwidth (1MB write) • Di DiskGluster Disk for data storage • Di HDD 88 MB/s • Gi GigE for communication Network 118 MB/s File System 83 MB/s 94 % • Me MemGluster • Me Memory for data storage MEM 6509 MB/s • RD RDMA A for communication 6350 MB/s RDMA File System 27 % 1779MB/s 5

RD RDMA-en enab abled led Dis Distrib ibuted ed File ile System em • Mo More e than fast hardware • It It is is subop optimal imal to o simp imply ly rep eplac lace e the e ne netw twork/st storage mod module le • Op Opport rtunities es and Challen enges es • NV NVM • By Byte-ad addressab abilit ility • Si Sign gnificant overhead of of data cop opies • RD RDMA • Fl Flexi xible p programming v verb rbs (m (messag age/memory s seman antic ics) • Imba Imbalanc nced ed CPU pr proces essing ng capa pacity vs. ne network I/ I/Os Os 6

Ou Outline • Background and Motivation • Oct ctopus Design • Evaluation • Conclusion 7

RD RDMA-en enab abled led Dis Distrib ibuted ed File ile System em Opportunity Approaches Byte-addressability of NVM Shared data managements One-sided RDMA verbs New data flow strategies CPU is the new bottleneck Flexible RDMA verbs Efficient RPC Primitive • It It is is nec eces essar ary y to re rethink the design of of D DFS o over N NVM & & RD RDMA 8

Oct Octopus A Arch chitect cture Read(“/home/lyy”) create(“/home/cym”) Client A Client B RDMA-based Data IO Self-Identified RPC Shared Persistent Memory Pool It performs remote direct data access just NVMM NVMM ... NVMM like an Octopus uses its eight legs N 2 N 2 N 3 HCA HCA HCA 9

1. 1. Shared Persi sistent t Me Memor ory Pool ool • Existing DFSs GlusterFS 7 copy • Redundant data copy Client Server User Space Buffer User Space Buffer Page mbuf mbuf Cache NIC NIC FS Image 10

1. 1. Shared Persi sistent t Me Memor ory Pool ool • Existing DFSs GlusterFS + DAX 6 copy • Redundant data copy Client Server User Space Buffer User Space Buffer Page mbuf mbuf Cache NIC NIC FS Image 11

1. Shared Persi 1. sistent t Me Memor ory Pool ool • Octopus with SPMP • Introduces the sh shared persi sistent memory • Existing DFSs po pool • Redundant data copy • Global view of data layout 4 copy Client Server User Space Buffer User Space Buffer Message mbuf mbuf Pool NIC NIC FS Image 12

2. . Client-Ac Acti tive e Da Data I/ I/O SERVER C 1 C 2 NIC CPU MEM time • Server-Active • Server threads process the data I/O • Works well for slow Ethernet • CPUs can easily become the bottleneck with fast hardware Lookup file data Send data • Client-Active • Let clients read/write data directly from/to the SPMP 13 Lookup file data Send address

3. 3. Self-Id Iden entif ified ied Metad adata a RPC PC Thead1 Theadn • Message-based RPC HCA Message Pool • easy to implement, lower throughput • Da DaRPC PC [S ], Fa FaSST [O [SoCC’14], [OSDI’16] • Memory-based RPC HCA DATA • CPU cores scan the message buffer • Fa FaRM [N [NSDI’14] Thead1 Theadn • Using rdma_write_with_imm? • Scan by polling HCA Message Pool • Imm data for self-identification HCA ID DATA 14

Ou Outline • Background and Motivation • Octopus Design • Evaluation • Conclusion 15

Ev Evaluation Setup • Evaluation Platform Cluster CPU Memory ConnectX-3 FDR Number A E5-2680 * 2 384 GB Yes * 5 B E5-2620 16 GB Yes * 7 • Connected with Mellanox SX1012 switch • Evaluated Distributed File Systems • memGluster, runs on memory, with RDMA connection • NVFS [O SU] , Crail [I [IBM] , optimized to run on RDMA [OSU • memHDFS, Alluxio, for big data comparison 16

Ov Overall E Effici ciency cy Latency Breakdown Bandwidth Utilization 100% 7000 6000 95% 5000 90% 4000 3000 85% 2000 80% 1000 75% 0 getattr readdir Write Read software mem network software mem network • Software latency is reduced from 326 u 326 us to 6 6 us us • Achieves read/write bandwidth that approaches the raw storage and network bandwidth 17

Me Metadata Operati tion on Perf rform ormance MKNOD GETATTR RMNOD 8 .E+06 6 .E+05 3 .E+05 8 .E+05 6 .E+04 8 .E+04 3 .E+04 6 .E+03 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 glusterfs nvfs glusterfs nvfs glusterfs nvfs crail dmfs crail dmfs crail dmfs crail-poll crail-poll crail-poll • Octopus provides metadata IOPS in the order of 10 # ~10 % • Octopus can scales linearly 18

Bi Big Da Data Evaluati tion on TestDFSIO (MB/s) Normalized Execution Time 3000 1.1 2500 1 2000 0.9 1500 0.8 1000 0.7 500 0 0.6 write read Teragen Wordcount memHDFS Alluxio NVFS Crail Octopus memHDFS Alluxio NVFS Crail Octopus • Octopus can also provide better performance for big data applications than existing file systems. 19

Con Conclusi sion on • It is necessary to rethink the DFS designs over emerging H/Ws • Octopus’s internal mechanisms • Simplifies data management layer by re reducing data ta copies Rebalances network and server loads with Client-Active I/O • Re • Redesigns the me metad adata a RPC and di distr tribut buted d tr transa nsacti tion n with RDMA primitives • Evaluations show that Octopus significantly outperforms existing file systems 20

Q& Q&A Th Thanks 21

Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed - PowerPoint PPT Presentation

Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed ed Pe Persistent Memory File System Yo Youyou Lu Lu 1 , , Ji Jiwu Sh Shu 1 , , Yo Youmin Ch Chen en 1 , , Ta Tao Li Li 2 1 Ts Tsinghua University 2 Uni University of

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

BI BIC Cas asset ette as as a a Ne New Tool to Enab Enable e No Novel el Bi Biological

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Kinematic Studies of Octopus Movements: 3D Reconstruction and Analysis of Motor Control by Yoram

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

Enab abling DP DPDK DK/SR-IOV f for r November 2017 2017 contai ainer erized zed V

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

2016-17 Jill Shattock Director of Commissioning Mission Enab ablin ing th the people le of

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

1 The Table Corpus The Table Corpus Table type % total count Small tables 88.06 12.34B

Use Case LSF operate OCTOPUS imaging cluster: lasers coupled to interconnected microscopy

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi

NFS/RDMA Tom Talpey Network Appliance tmt@netapp.com IETF NFSv4 Interim WG meeting Ann Arbor,

octopus'GPU'cluster'inaugura2on' Scien2fic'Compu2ng'Group' Unil'|'18'fvrier'2016' Welcome'

The Octopus Antenna A 4-B AND 8-E LEMENT D IPOLE A RRAY C ONSTRUCTED B Y C LIFF P ULIS , K E 0CP W

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov

Gal Galena ena Un Unit S School D ool Distri rict # #120 120 - - - - - - - - - - - - - -

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs),

planning and Business Property Relief Mini Powwow with Octopus Key risks and important

antigen igen distri stribu bution ion in placen acenta ta and d fetuse ses Estela