high performance computing aub
play

High Performance Computing @ AUB GradEx Workshop Mher Kazandjian - PowerPoint PPT Presentation

High Performance Computing @ AUB GradEx Workshop Mher Kazandjian November 2018 American University of Beirut How this talk is structured? History of computing Scientifjc computing workfmows Computer architecture overview


  1. High Performance Computing @ AUB GradEx Workshop Mher Kazandjian November 2018 American University of Beirut

  2. How this talk is structured? • History of computing • Scientifjc computing workfmows • Computer architecture overview • Do's and Don'ts • Demo's and walk throughs

  3. Goals • Demonstrate how you (as users) can benefjt from AUB's HPC facilities • Attract users, because: • we want to boost scientifjc computing research • we want to help you • we have capacity This presentation is based on actual feedback and use cases collected from users over the past year

  4. History of computing Alan Turing 1912-1954

  5. Growth over time 12 orders of magnitude since 1960

  6.  Growth over time ~12 orders of magnitude since 1960 if you had 1000$ in 1970 you could do 10^12 times more calculations with hardware that costs the same today

  7.  What is HPC used for today? ● Solving scientifjc problems ● Data mining and deep learning ● Military research and security ● Cloud computing ● Blockchain (cryptocurrency)

  8.  What is HPC used for today? ● https://blog.openai.com/ai-and-compute/

  9.  Growth over time Multicores hit the markets in ~ 2005 Users at home started benefjting from parallelism Click to add text Click to add text Click to add text Prior to that Click to add text Click to add text applications that Click to add text Click to add text Click to add text scaled well were Click to add text Click to add text restricted to Click to add text Click to add text mainframes / Click to add text Click to add text datacenters and HPC clusters

  10. HPC @ AUB 8 compute nodes in 2006 Specs per node - 4 cores - 8 GB ram ~ 80 GFlops

  11. HPC is all about scalability • The high speed network is the " most" important component

  12. But what is scalability? Performance improvements as the number of cores (resources) increases for the same problem size - hard scalability

  13. But what is scalability? This is a CPU under a microscope

  14. But what is scalability? 2 sec Prog.exe Serial runtime = T_serial

  15. But what is scalability? 1 sec Prog.exe Prog.exe parallel runtime = T_parallel

  16. But what is scalability? 0.5 sec Prog.exe Prog.exe Prog.exe Prog.exe parallel runtime = T_parallel

  17. But what is scalability? 0.5 sec Prog.exe Prog.exe Prog.exe Prog.exe Very nice!! but this is usually never the case

  18. First demo – First scalability diagram

  19. But what is scalability? Repeat the same process across multiple processors Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe

  20. But what is scalability? Wait! - how do these processors talk to each other? - how much data needs to be transferred for a certain task? - how fast do the processes communicate with each other? - how often should the processes communicate with each other? Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe

  21. At the single chip level Through the cache memory of the CPU Typical latency ~ ns (or less) Typical bandwidth > 150 GB/s

  22. At the single chip level Through the RAM Through the RAM Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 50 GB/s (sometimes more) https://ark.intel.com/#@Processors Random Access Momory (aka RAM)

  23. Second demo: bandwidth and some lingo - An array is just a bunch of bytes - Bandwidth is the speed with which information is tranferred - A fmoat (double precision) is 8 bytes - an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB - if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero) - bandwidth = size of array / time to initialize it

  24. Second demo: bandwidth and some lingo - An array is just a bunch of bytes - Bandwidth is the speed with which information is tranferred - A fmoat (double precision) is 8 bytes - an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB - if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero) - bandwidth = size of array / time to initialize it Intel i7-6700HQ - https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3-50-GHz- - Advertised bandwidth = 34 GB/s - measured bandwidth (single thread quickie) = 22.8 GB/s

  25. At the single motherboard level Through QPI (quick path interconnect) - typical latency for small data ~ ns - typical bandwidth 100 GB/s QPI TIP: server = node = compute node = numa node Random Access Random Access Memory (aka RAM) Memory (aka RAM) Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 100 GB/s Through the RAM (sometimes more)

  26. Second demo: bandwidth multi-threaded - https://github.com/jefghammond/STREAM https://ark.intel.com/products/64597/Intel-Xeon-Processor-E5-2665-20M-Cache-2_40-GHz-8_00-GT s-Intel-QPI Another benchmark 2 socket Intel Xeon server - 2 x sockets, expected bandwidth ~102 GB/s - measured ~ 75 GB/s - on a completely idle node ~95 GB/s is possible

  27. At the cluster level (multiple nodes) Through the network (ethernet) Typical latency ~ 10 micro-sec to 100 micro sec Typical bandwidth ~ 100 MB/s to a few 100 MB/s

  28. At the cluster level (multiple nodes) Through the network (infiniband – high speed network) Typical latency ~ a few to micro-seconds to < 1 micro sec Typical bandwidth > 3 GB/s Benefits over ethernet: - Remote direct memory access - higher bandwidth - much lower latency https://en.wikipedia.org/wiki/InfiniBand

  29. What hardware we have at AUB What hardware we have at AUB? - Arza: - 256 core, 1 TB RAM IBM cluster - production simulations, benchmarking - http://website.aub.edu.lb/it/hpc/Pages/home.aspx - vLabs - see Vassili’s slide - very flexible, easy to manage, windows support - public cloud - infinite resources – limited by $$$ - two pilot projects being tested – will be open soon for testing

  30. Parallelization libraries / software SMP parallelism - OpenMP - CUDA - Matlab - Spark (recently deployed and tested) distributed parallelism (cluster wide) - MPI - Spark - MPI + OpenMP (hybrid) - MPI + CUDA - MPI + CUDA + OpenMP - Spark + CUDA (not tested – any volunteers?)

  31. Linux/Unix culture > 99% of HPC clusters wold wide use some kind of linux / unix - Clicking your way to install software is easy for you (on windows or mac), but a nightmare for power users. - Linux is: - open-source - free - secure (at least much secure than windows et. al ) - no need for an antivirus that slows down your system - respects your privacy - huge community support in scientific computing - 99.8% of all HPC systems world wide since 1996 are non-windows machines https://github.com/mherkazandjian/top500parser

  32. Software stack on the HPC cluster - Matlab - C, Java, C++, fortran - python 2 and python 3 - jupyter notebooks - Tensorflow (Deep learning) - Scala - Spark - R - R studio, R server (new)

  33. Cluster usage: Demo - The scheduler: resource manager - bjobs - bqueues - bhosts - lsload - important places - / gpfs1 /my_username - / gpfs1/ apps/sw - basic linux knowledge - sample job script

  34. Cluster usage: Documentation https://hpc-aub-users-guide.readthedocs.io/en/latest/ https://github.com/hpcaubuserguide/hpcaub_userguide The guide is for you - we want you to contribute to it directly - please send us pull requests

  35. Cluster usage: Job scripts https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html

  36. Cluster usage: Job scripts https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html In the user guide, there are samples and templates for many use cases: - we will help you write your own if your use case is not covered - this is 90% of the getting started task - recent success story: - spark server job template

  37. Cluster usage: Job scripts https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html

  38. How to benefjt from the HPC hardware? - run many serial jobs that do not need to communicate - aka embarrassingly parallel jobs (nothing embarrasing about it though as long as you get your job done) - e.g - train several neural networks with different layer numbers - do a parameter sweep for a certain model ./my_prog.exe --param 1 & ./my_prog.exe --param 2 & ./my_prog.exe --param 3 & These would execute simultaneously - difficulty: very easy

  39. How to benefjt from the HPC hardware? - run many serial jobs that do not need to communicate Demo

  40. How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - e.g - matlab - C/C++/python/Java Difficulty: very easy to medium (problem dependent)

  41. How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - C

  42. How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - C

  43. How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - Demo: matlab parfor

  44. How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - Demo: matlab parfor

Recommend


More recommend