memory scalability evaluation of the next generation
play

Memory Scalability Evaluation of the Next-Generation Intel Bensley - PowerPoint PPT Presentation

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The


  1. Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University

  2. Introduction • Computer systems have increased significantly in processing capability over the last few years in various ways – Multi-core architectures are becoming more prevalent – High-speed I/O interfaces, such as PCI-Express have enabled high-speed interconnects such as InfiniBand to deliver higher performance • The area that has improved the least during this time is the memory controller

  3. Traditional Memory Design • Traditional memory controller design has limited the number of DIMMs per memory channel as signal rates have increased • Due to high pin count (240) required for each channel, adding additional channels is costly • End result is equal or lesser memory capacity in recent years

  4. Fully-Buffered DIMMs (FB-DIMMs) • FB-DIMM uses serial lanes with a buffer on each chip to eliminate this tradeoff • Each channel requires only 69 pins • Using the buffer allows larger numbers of DIMMs per channel as well as increased Image courtesy of Intel Corporation parallelism

  5. Evaluation • With multi-core systems coming, a scalable memory subsystem is increasingly important • Our goal is to compare FB-DIMM against a traditional design and evaluate the scalability • Evaluation Process – Test memory subsystem on a single node – Evaluate network-level performance with two InfiniBand Host Channel Adapters (HCAs)

  6. Outline • Introduction & Goals • Memory Subsystem Evaluation – Experimental testbed – Latency and throughput • Network results • Conclusions and Future work

  7. Evaluation Testbed • Intel “Bensley” system – Two 3.2 GHz dual-core Intel Xeon “Dempsey” processors – FB-DIMM-based memory subsystem • Intel Lindenhurst system – Two 3.4 GHz Intel Xeon processors – Traditional memory subsystem (2 channels) • Both contain: – 2 8x PCI-Express slots – DDR2 533-based memory – 2 dual-port Mellanox MT25208 InfiniBand HCAs

  8. Bensley Memory Configurations • The standard allows up to 6 Slot 3 Channel3 Channel2 Channel1 Channel0 Slot 2 Branch1 channels with 8 Slot 1 DIMMs/channel for Slot 0 Processor 0 Slot 3 192GB Slot 2 Slot 1 • Our systems have Slot 0 4 channels, each Slot 3 Slot 2 with 4 DIMM slots Branch0 Slot 1 • To fill 4 DIMM Slot 0 Processor 1 Slot 3 slots we have 3 Slot 2 combinations Slot 1 Slot 0

  9. Subsystem Evaluation Tool lmbench 3.0-a5 : Open-source benchmark suite for evaluating system-level performance • Latency – Memory read latency • Throughput – Memory read benchmark – Memory write benchmark – Memory copy benchmark Aggregate performance is obtained by running multiple long-running processes and reporting the sum of averages

  10. Bensley Memory Throughput Copy Read Write 6000 4000 2500 3500 5000 2000 Aggregate Throughput Aggregate Throughput Aggregate Throughput 3000 4000 2500 1500 (MB/sec) (MB/sec) (MB/sec) 3000 2000 1000 1500 2000 1000 500 1000 500 0 0 0 1 2 4 1 2 4 1 2 4 Number of Processes Number of Processes Number of Processes • To study the impact of additional channels we evaluated using 1, 2, and 4 channels • Throughput increases significantly from one to two channels in all operations

  11. Access Latency Comparison • Comparison when unloaded Unloaded Loaded and loaded 200 180 160 • Loaded is when a memory read throughput test is run in 140 Read Latency (ns) the background while the 120 latency test is runnning 100 80 • From unloaded to loaded 60 latency: 40 – Lindenhurst: 40% increase 20 – Bensley: 10% increase 0 Bensley Lindenhurst 4GB 8GB 16GB 2GB 4GB

  12. Memory Throughput Comparison Copy Read Write 6000 4000 2500 3500 5000 Aggregate Throughput Aggregate Throughput Aggregate Throughput 2000 3000 4000 2500 (MB/sec) (MB/sec) (MB/sec) 1500 2000 3000 1000 1500 2000 1000 500 1000 500 0 0 0 1 2 4 1 2 4 1 2 4 Number of Processes Number of Processes Number of Processes Bensley 4GB Bensley 8GB Bensley 16GB Bensley 4GB Bensley 8GB Bensley 16GB Bensley 4GB Bensley 8GB Bensley 16GB • Comparison of Lindenhurst and Bensley platforms with increasing memory size • Performance increases with two concurrent read or write operations on the Bensley platform

  13. Outline • Introduction & Goals • Memory Subsystem Evaluation – Experimental testbed – Latency and throughput • Network results • Conclusions and Future work

  14. OSU MPI over InfiniBand • Open Source High Performance Implementations – MPI-1 (MVAPICH) – MPI-2 (MVAPICH2) • Has enabled a large number of production IB clusters all over the world to take advantage of InfiniBand – Largest being Sandia Thunderbird Cluster (4512 nodes with 9024 processors) • Have been directly downloaded and used by more than 395 organizations worldwide (in 30 countries) – Time tested and stable code base with novel features • Available in software stack distributions of many vendors • Available in the OpenFabrics(OpenIB) Gen2 stack and OFED • More details at http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

  15. Experimental Setup Round Robin Process Binding HCA 0 HCA 1 HCA 0 HCA 1 HCA 0 HCA 1 HCA 0 HCA 1 P2 P0 P0 P2 P2 P0 P0 P2 P3 P1 P1 P3 P3 P1 P1 P3 • Evaluation is with two InfiniBand DDR HCAs, which uses the “multi-rail” feature of MVAPICH • Results with one process use both rails in a round-robin pattern • 2 and 4 process pair results are done using a process binding assignment

  16. Uni-Directional Bandwidth 3000 Bensley: 1 Process Bensley: 2 Processes 2500 Bensley: 4 Processes Throughput (MB/sec) Lindenhurst: 1 Process 2000 Lindenhurst: 2 Processes 1500 1000 500 0 1 2 4 8 1K 2K 4K 8K 16 32 64 16K 32K 64K 1M 128 256 512 128K 256K 512K Message Size (bytes) • Comparison of Lindenhurst and Bensley with dual DDR HCAs • Due to higher memory copy bandwidth, Bensley signficantly outperforms Lindenhurst for the medium-sized messages

  17. Bi-Directional Bandwidth 5000 Bensley: 1 Process 4500 Bensley: 2 Processes Bensley: 4 Processes 4000 Lindenhurst: 1 Process Throughput (MB/sec) 3500 Lindenhurst: 2 Processes 3000 2500 2000 1500 1000 500 0 1 2 4 8 1K 2K 4K 8K 16 32 64 16K 32K 64K 1M 128 256 512 128K 256K 512K Message Size (bytes) • At 1K improvement: – Lindenhurst: 1 to 2 processes:15% – Bensley: 1 to 2 processes: 75%, 2 to 4: 45% • Lindenhurst peak bi-directional bandwidth is only 100 MB/sec greater than uni-directional

  18. Messaging Rate 3.5 Bensley: 1 Process Bensley: 2 Processes 3 Bensley: 4 Processes (millions messages/sec) Lindenhurst: 1 Process 2.5 Lindenhurst: 2 Processes Throughput 2 1.5 1 0.5 0 1 2 4 8 1K 2K 4K 8K 16 32 64 16K 32K 64K 128K 256K 128 256 512 Message Size (bytes) • For very small messages, both show similar performance • At 512 bytes: Lindenhurst 2 process case is only 52% higher than 1 process, Bensley still shows 100% improvement

  19. Outline • Introduction & Goals • Memory Subsystem Evaluation – Experimental testbed – Latency and throughput • Network results • Conclusions and Future work

  20. Conclusions and Future Work • Performed detailed analysis of the memory subsystem scalability of Bensley and Lindenhurst • Bensley shows significant advantage in scalable throughput and capacity in all measures tested • Future work: – Profile real-world applications on a larger cluster and observe the effects of contention in multi-core architectures – Expand evaluation to include NUMA-based architectures

  21. Acknowledgements Our research is supported by the following organizations • Current Funding support by • Current Equipment support by 21

  22. Web Pointers {koop, huanwei, vishnu, panda}@cse.ohio-state.edu http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ 22

Recommend


More recommend