jae woo choi dong in shin young jin yu hyunsang eom heon
play

Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young - PowerPoint PPT Presentation

Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young Yeom Seoul National Univ. TAEJIN INFOTECH SAN with High-Speed Network + Host Computer Fast Storage (initiator) = Fast SAN Environment ??? Performance degradation About


  1. Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young Yeom Seoul National Univ. TAEJIN INFOTECH

  2. SAN with High-Speed Network + Host Computer Fast Storage (initiator) = Fast SAN Environment ??? Performance degradation About 65% reduction High-Speed Virtual Storage Storage Server Network(Infiniband) (target) Bottleneck HDD Fast Storage SAN

  3.  Found performance degradation in existing SAN solution with a fast storage  Proposed three optimizations for Fast SAN solution  Mitigate software overheads in SAN I/O path  Increase parallelism on Target side  Temporal merge for RDMA data transfer  Implemented the new SAN solution as a prototype

  4.  DRAM-SSD (provided by TAEJIN Infotech)  7 usecs for reading/writing a 4KB page  Peak device throughput: 700 MB/s  DDR2 64 GB, PCI-Express type

  5.  FIO micro benchmark, 16 threads 800 700 Uniform Throughput 600 Throughput (MB/s) 500 400 300 200 100 0 wr r_wr rd r_rd wr r_wr rd r_rd wr r_wr rd r_rd wr r_wr rd r_rd 4K (buff) 1M (buff) 4K (direct) 1M (direct) Buffered I/O Direct I/O

  6.  Generic SCSI Target Subsystem for Linux  Open Program for implementing SAN environment  Support Ethernet, FC, Infiniband and so on.  Use SRP(SCSI RDMA Protocol) for Infiniband

  7. SPEC TARGET INITATOR CPU Intel Xeon E5630 (8 core) Intel Xeon E5630 (8 core) Memory 16GB 8GB INFINIBAND CARD MHQH19B-XTC 1port (40Gb/s) MHQH19B-XTC 1port (40Gb/s) - Device :DRAM SSD(64GB) - Workload size : 16 thread x 3GB (48GB) - Request size : 4K/1M - I/O type: Buffered/Direct, Sequential/Random, Read/Write - Benchmark Tool : FIO micro benchmark

  8.  I/O Scheduler policy  CFQ -> NOOP 800 800 merge Read-ahead 700 700 Throughput (MB/s) 600 600 Reasonable throughput 500 500 gap 400 400 300 300 200 200 100 100 0 0 wr r_wr rd r_rd wr r_wr rd r_rd wr r_wr rd r_rd wr r_wr rd r_rd 4K (buff) 4K (direct) 1M (buff) 1M (direct) SRP (CFQ) SRP (NOOP) Local SRP (CFQ) SRP (NOOP) Local Small Size Large Size

  9. Elevator for request merge Plug-Unplug mechanism Cause some delays Too long

  10.  Remove software overheads in I/O path  Bypass SCSI layer  Discard existing I/O scheduler ▪ Remove elevator merge and plug-unplug ▪ Maintain wait-queue based on bio structure ▪ Very simple & fast I/O scheduler  BRP(Block RDMA Protocol)  Commands are also based on bio structure, not scsi command

  11. Event_handler Operations for I/O request Jobs for executing RDMA data transfer Analyze events Jobs for sending and execute responses to Initiator proper operations Jobs for termination of I/O requests Jobs for Device I/O All these operations are independent each other can be processed in parallel

  12. Event_handler Thread Pool Jobs for executing RDMA data transfer Analyze events Jobs for sending and execute responses to Initiator proper operations Jobs for terminating I/O requests Jobs for Device I/O Serial Execution

  13.  Increase Parallelism on the Target side  All the procedures for I/O requests are processed in thread-pool ▪ Induce Multiple device I/O Thread Pool Exploit high bandwidth of fast device Storage

  14. initiator target initiator target command Temporal Pre-Processing merge Jumbo command RDMA Pre-Processing Post-Processing completion RDMA Post-Processing

  15.  RDMA data transfer with temporal merge  Merge small sized data regardless of its spatial continuance  Enabled at the only intensive-I/O situation

  16.  BRP-1  Remove software overhead in I/O path  BRP-2  BRP-1 + Increase Parallelism  BRP-3  BRP2 + Temporal Merge at the intensive I/O situation  Just BRP means BRP-3

  17.  Latency comparison  Direct I/O, 4KB  dd test I/O Type SRP(usec) BRP(usec) Latency Reduction Read 63 (51) 43 (31) -31.7 (-39.2) % Write 75 (62) 54 (41) -28 (-33.8)% ( ) : the value excepting device I/O latency read-12usec, write-13usec

  18. 800 700 600 Throughput (MB/s) 500 400 SRP (NOOP) BRP 300 Local 200 100 0 wr r_wr rd r_rd wr r_wr rd r_rd (buff) (direct)

  19. 800 700 600 Throughput (MB/s) 500 400 SRP (NOOP) BRP 300 Local 200 100 0 wr r_wr rd r_rd wr r_wr rd r_rd (buff) (direct)

  20. FIO benchmark, random write, 4KB, direct I/O, 700 600 500 Throughput(MB/s) SRP(NOOP) 400 BRP-1 300 BRP-2 BRP-3T 200 100 0 4 8 16 32 64 128 256 512 BRP-3T: always executes temporal merge

  21. FIO benchmark, 4KB, 16 threads 1.20 256 threads 1.00 Nomalized Throughput local 0.80 SRP(NOOP) 0.60 BRP-1 BRP-2 0.40 BRP-3 0.20 0.00 r_wr(buff) r_wr(direct) r_rd(direct)

  22.  SAN with high performance storage  Propose new SAN solution  Remove Software overheads in I/O path  Increase parallelism on the Target side  Temporal merge for RDMA data transfer  Implement the optimized SAN as a prototype

  23. Thank you ! Q nA?

Recommend


More recommend