modeling communication costs in blade servers
play

Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin - PowerPoint PPT Presentation

Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin Lee Duke University October 4th, 2015 Duke Computer Architecture Case for Blade Servers An era of big data Duke Computer Architecture 2 Case for Blade Servers An era of


  1. Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin Lee Duke University October 4th, 2015 Duke Computer Architecture

  2. Case for Blade Servers An era of big data Duke Computer Architecture 2

  3. Case for Blade Servers An era of big data needs big memory. • Machines with large memory • Distributed memory systems HP Moonshot Server Cartridge Distributed systems Duke Computer Architecture 3

  4. Case for Blade Servers a node a blade Figure 1: Two blade server nodes connected through Ethernet [1,2] [1] K. Lim, J Chang, T. Mudge, P. Ranganathan. Disaggregated memory for expansion and sharing in blade servers. [2] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong. Cost effective data center servers. Duke Computer Architecture 4

  5. Case for Blade Servers blade&0 blade&1 MC C M M M M RC NTB Inter-processor Links M M M M (e.g., HyperTransport) Figure 1: Two blade server nodes connected Inter-blade Links through Ethernet [1,2] (e.g., PCIe) M M M M M M M M blade&2 blade&3 Figure 2 : 2D figure of a server node design with inter-blade links and inter-processor links. Blade servers provide compute and memory capacity in a dense form factor. [1] K. Lim, J Chang, T. Mudge, P. Ranganathan. Disaggregated memory for expansion and sharing in blade servers. [2] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong. Cost effective data center servers. Duke Computer Architecture 5

  6. Case for Blade Servers Applications: in-memory computational frameworks • Big data analytical frameworks: e.g., Spark • Graph type of workloads: e.g., GraphLab, Spark GraphX • In-memory databases: e.g. MonetDB Challenges: hardware-software co-design costs both engineering efforts and time. A fast and cost-effect way to understand the system is through technology models. Duke Computer Architecture 6

  7. Motivation for Technology Models We identify and derive key technology parameters for analyzing their effects on system performance, throughput and energy. Potentially, those models can help to • choose hardware technologies and configurations • understand performance and energy impacts • close the loop for hardware and software co-design Duke Computer Architecture 7

  8. Agenda 1. Derive technology models 2. Characterizing non-uniform memory access 3. Develop NUMA-aware schedulers Duke Computer Architecture 8

  9. Communication Technologies DDR3 • HyperTransport memory • Intel Quick Path blade&0 blade&1 inter- processor MC C M M M M RC NTB Inter-processor Links M M M M (e.g., HyperTransport) Inter-blade Links (e.g., PCIe) M M M M inter-blade • PCIe 3.0 M M M M • InfiniBand blade&2 blade&3 Figure 2 : A blade server node design with inter-blade links and inter-processor links. Duke Computer Architecture 9

  10. Delay and Energy Estimates Key Figure 3: Derived and surveyed technology and architectural parameters Estimates Duke Computer Architecture 10

  11. With these Estimates • Explore system organizations for blade servers • Analyze communication delay and energy • Address challenges in system management • e.g.: non-uniform memory access (NUMA) Duke Computer Architecture 11

  12. Agenda 1. Derive technology models 2. Characterizing non-uniform memory access 3. Develop NUMA-aware schedulers Duke Computer Architecture 12

  13. NUMA Effects Interprocessor Interblade • Processors access different 1.8 CPI normalized to local memory regions with different 1.6 latencies — non-uniform access memory access ( NUMA) 1.4 • NUMA degrades application 1.2 performance 1 t k n n n o u a i s o r s c • Multiple communication paths e e d g r g r a o e P w r introduce multiple levels of c i t s i g o NUMA L Figure 4 : Single thread performance (CPI) degradation for NUMA access. Duke Computer Architecture 13

  14. NUMA-aware Scheduling Policies Figure 5: NUMA-aware scheduling algorithms [3] [3] M. Zaharia et el, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Duke Computer Architecture 14

  15. NUMA-aware Scheduling Policies • Local execution • IP-1 : inter-processor 1-hop execution • IP-2 : inter-processor 2-hop execution • IB : inter-blade execution Applications’ NUMA effects vary; throughput and latency goals differ. Choose the optimal policy accordingly. Duke Computer Architecture 15

  16. Methods - NUMA Simulation Characterize application sensitivity to NUMA over each type of communication technology Marssx86 + DRAMSim Interconnections CPU DRAM DRAM CPU Add additional latency for different communication paths Duke Computer Architecture 16

  17. Methods - Remote vs Local Local Remote Benchmarks: 1 0.8 1-7: Apache Spark • 0.6 8-11: Phoenix MapReduce • 0.4 0.2 12-20: PARSEC 2.0 • 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 7: distinguish remote vs local (assuming heap is remote; x-axis is benchmark id; y-axis is percentage.) Duke Computer Architecture 17

  18. Methods - Queueing Simulation Model task queues and analyze queueing dynamics. One blade server node Cores per socket 16 Sockets per blade 4 Change inter-arrival time to Service time per core vary system utilization changes based on NUMA Number of blades per node 4 effects Task size 100M instructions Inter-arrival time exponential distribution λ = 6000 t/s Service time/core # Instructions/IPC/core frequency Figure 8: Queueing simulation parameters Duke Computer Architecture 18

  19. Results — Throughput Maximum Sustained Throughput • Increase the system load 1.6 Local to test the maximum IP − 1 1.5 sustained throughput. IP − 2 1.4 Normalized to IB • Avoiding NUMA always 1.3 increases throughput. 1.2 1.1 1 0.9 1 2 3 4 5 6 7 8 91011121314151617181920 Compute-intensive: 7, 9-11, 13-20 • Memory-intensive: 1-6, 8, 12 • Duke Computer Architecture 19

  20. Results — Latency/QoS 95th Percentile Response Time (High Utilization) • Permitting NUMA can 2.2 Local improve the quality of IP − 1 2 IP − 2 service. Speed − up Relative to IB 1.8 • CI tasks should choose 1.6 IB to permit NUMA. 1.4 1.2 • MI tasks should choose IP-1 and IP-2 to 1 selectively permit NUMA 0.8 1 2 3 4 5 6 7 8 91011121314151617181920 in highly loaded servers. Compute-intensive: 7, 9-11, 13-20 • Memory-intensive: 1-6, 8, 12 • Duke Computer Architecture 20

  21. Results — Communication Energy Data Migration Energy 8 • If data is near, remote Inter − processor 1 − Hop Links access is more beneficial Inter − processor 2 − Hop Links 7 Normalized to Remote Access Inter − blade Links (3-4x) on for saving 6 energy. 5 • If data is far, remote 4 access is less beneficial because of high-cost 3 links. 2 1 • Energy benefits depend on page reuse rate and 0 1 2 3 4 5 6 7 8 91011121314151617181920 communication channels. Compute-intensive: 7, 9-11, 13-20 • Memory-intensive: 1-6, 8, 12 • 18-20 is out of scope • Duke Computer Architecture 21

  22. Results — Communication Channels Local DRAM Inter − processor 1 − Hop Inter − processor 2 − Hop Inter − blade 1 0.8 0.6 0.4 0.2 0 Local IP − 1 IP − 2 IB Figure 9: link utilization percentages for application 1. • Use link utilization percentage to estimate average communication power. Duke Computer Architecture 22

  23. Results — Communication Power Communication Power 20 • HyperTransport and Local IP − 1 PCIe dissipate around IP − 2 IB 15 40W, 60W at peak utilization. W 10 • S1-S6 suggests that these Spark workloads 5 use about 25% of the link bandwidth. 0 1 2 3 4 5 6 7 8 91011121314151617181920 Compute-intensive: 7, 9-11, 13-20 • Memory-intensive: 1-6, 8, 12 • 12 is out-of-scope • Duke Computer Architecture 23

  24. Conclusions and Future Directions • Model blade servers for emerging big-data applications. • Study NUMA-aware schedulers and their effects on throughput, latency and power. • Provide guidelines for choosing an optimal policy. Future directions: • Extend validation to real system measurements. Duke Computer Architecture 24

  25. Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin Lee Duke University October 4th, 2015 Duke Computer Architecture

Recommend


More recommend