environment cle
play

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. - PowerPoint PPT Presentation

A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com> <kuehn@ornl.gov> Does CLE waddle like a penguin, or run like a catamount? THE BIG


  1. A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com> <kuehn@ornl.gov>

  2. Does CLE waddle like a penguin, or run like a catamount? THE BIG QUESTION! CUG2008 2

  3. Overview Background Motivation Catamount and CLE Benchmarks Benchmark System Benchmark Results IMB HPCC Conclusions CUG2008 3

  4. BACKGROUND CUG2008 4

  5. Motivation Last year at CUG “CNL” was in its infancy Since CUG07 Significant effort spent scaling on large machines CNL reached GA status in Fall 2007 Compute Node Linux (CNL) renamed Cray Linux Environment (CLE) A significant number of sites have already made the change Many codes have already ported from Catamount to CLE Catamount scalability has always been touted, so how does CLE compare? Fundamentals of communication performance HPCC IMB What should sites/users know before they switch? CUG2008 5

  6. Background: Catamount Developed by Sandia for Red Storm Adopted by Cray for the XT3 Extremely light weight Simple Memory Model No Virtual Memory No mmap Reduced System Calls Single Threaded No Unix Sockets No dynamic libraries Few Interrupts to user codes Virtual Node (VN) mode added for Dual-Core CUG2008 6

  7. Background: CLE First, we tried a full SUSE Linux Kernel. Then, we “put Linux on a diet.” With the help of ORNL and NERSC, we began running at large scale By Fall 2007, we released Linux for the compute nodes What did we gain? Threading Unix Sockets I/O Buffering CUG2008 7

  8. Background: Benchmarks HPCC Suite of several benchmarks, released as part of DARPA HPCS program MPI performance Performance for varied temporal and spatial localities Benchmarks are run in 3 modes SP – 1 node runs the benchmark EP – Every node runs a copy of the same benchmark Global – All nodes run benchmark together Intel MPI Benchmarks (IMB) 3.0 Formerly Pallas benchmarks Benchmarks standard MPI routines at varying scales and message sizes CUG2008 8

  9. Background: Benchmark System All benchmarks were run on the same system, “Shark,” and with the latest OS versions as of Spring 2008 System Basics Cray XT4 2.6 GHz Dual-Core Opterons (Able to run to 1280 Cores) DDR2-667 Memory, 2GB/core Catamount (1.5.61) CLE, MPT2 (2.0.50) CLE, MPT3 (2.0.50, xt-mpt 3.0.0.10) CUG2008 9

  10. BENCHMARK RESULTS CUG2008 10

  11. HPCC CUG2008 11

  12. Parallel Transpose (Cores) 140 120 100 Catamount SN 80 GB/s Catamount VN CLE MPT2 N1 60 CLE MPT2 N2 40 CLE MPT3 N1 CLE MPT3 N2 20 0 0 500 1000 1500 Processor Cores CUG2008 12

  13. Parallel Transpose (Sockets) 120 100 80 Catamount SN GB/s Catamount VN 60 CLE MPT2 N1 CLE MPT2 N2 40 CLE MPT3 N1 CLE MPT3 N2 20 0 0 100 200 300 400 500 600 Sockets CUG2008 13

  14. MPI Random Access 3 2.5 2 Catamount SN GUP/s Catamount VN 1.5 CLE MPT2 N1 CLE MPT2 N2 1 CLE MPT3 N1 CLE MPT3 N2 0.5 0 0 500 1000 1500 Processor Cores CUG2008 14

  15. MPI-FFT (cores) 250 200 Catamount SN 150 GFlops/s Catamount VN CLE MPT2 N1 100 CLE MPT2 N2 CLE MPT3 N1 50 CLE MPT3 N2 0 0 200 400 600 800 1000 1200 Processor Cores CUG2008 15

  16. MPI-FFT (Sockets) 250 200 Catamount SN 150 GFlops/s Catamount VN CLE MPT2 N1 100 CLE MPT2 N2 CLE MPT3 N1 50 CLE MPT3 N2 0 0 100 200 300 400 500 600 Sockets CUG2008 16

  17. Naturally Ordered Latency 16 14 12 10 Time (usec) 8 6 4 2 0 512 Catamount SN 6.41346 CLE MPT2 N1 9.08375 CLE MPT3 N1 9.41753 Catamount VN 12.3024 CLE MPT2 N2 13.8044 CLE MPT3 N2 9.799 CUG2008 17

  18. Naturally Ordered Bandwidth 1.2 1 0.8 MB/s 0.6 0.4 0.2 0 512 Catamount SN 1.07688 CLE MPT2 N1 0.900693 CLE MPT3 N1 0.81866 Catamount VN 0.171141 CLE MPT2 N2 0.197301 CLE MPT3 N2 0.329071 CUG2008 18

  19. IMB CUG2008 19

  20. IMB Ping Pong Latency (N1) 12 10 8 Time (usec) 6 Catamount CLE MPT2 4 CLE MPT3 2 0 0 200 400 600 800 1000 1200 Message Size (B) CUG2008 20

  21. IMB Ping Pong Latency (N2) 10 9 8 7 Avg uSec 6 5 Catamount 4 CLE MPT2 CLE MPT3 3 2 1 0 0 200 400 600 800 1000 1200 Bytes CUG2008 21

  22. IMB Ping Pong Bandwidth 600 500 400 MB/s 300 Catamount CLE MPT2 200 CLE MPT3 100 0 0 200 400 600 800 1000 1200 Message Size (Bytes) CUG2008 22

  23. MPI Barrier (Lin/Lin) 160 140 120 Time (usec) 100 80 Catamount CLE MPT2 60 CLE MPT3 40 20 0 0 500 1000 1500 Processor Cores CUG2008 23

  24. MPI Barrier (Lin/Log) 160 140 120 Time (usec) 100 80 Catamount CLE MPT2 60 CLE MPT3 40 20 0 1 10 100 1000 10000 Processor Cores CUG2008 24

  25. MPI Barrier (Log/Log) 1000 100 Time (usec) Catamount 10 CLE MPT2 CLE MPT3 1 1 10 100 1000 10000 0.1 Processor Cores CUG2008 25

  26. SendRecv (Catamount/CLE MPT2) CUG2008 26

  27. SendRecv (Catamount/CLE MPT3) CUG2008 27

  28. Broadcast (Catamount/CLE MPT2) CUG2008 28

  29. Broadcast (Catamount/CLE MPT3) CUG2008 29

  30. Allreduce (Catamount/CLE MPT2) CUG2008 30

  31. Allreduce (Catamount/CLE MPT3) CUG2008 31

  32. AlltoAll (Catamount/CLE MPT2) CUG2008 32

  33. AlltoAll (Catamount/CLE MPT3) CUG2008 33

  34. CONCLUSIONS CUG2008 34

  35. What we saw Catamount CLE Does very well on dual- Handles Single Core core (SN/N1) Runs slightly better Likes large messages and large core counts Seems to handle small MPT3 helps performance messages and small core and closes the gap counts slightly better between QK and CLE CUG2008 35

  36. What’s left to do? We’d really like to try this again on a larger machine Does CLE continue to beat Catamount above 1024, or will the lines converge or cross? What about I/O? Linux adds I/O buffering, how does this affect I/O performance at scale? How does this translate into application performance? See "Cray XT4 Quadcore: A First Look", Richard Barrett, et.al., Oak Ridge National Laboratory (ORNL) CUG2008 36

  37. Does CLE waddle like a penguin, or run like a catamount? CLE RUNS LIKE A BIG CAT! CUG2008 37

  38. Acknowledgements This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05- 00OR22725. Thanks to Steve, Norm, Howard, and others for help investigating and understanding these results CUG2008 38

Recommend


More recommend