the old challenge how to support users
play

The old challenge: How to support users? - PowerPoint PPT Presentation

The old challenge: How to support users? mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017 CC-HPC@itwm.fraunhofer.de 1 CC-HPC@itwm.fraunhofer.de What we do: Holistic optimization, dealing with many core (structured) SMP machines (Xeon,


  1. The old challenge: How to support users? mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  2. CC-HPC@itwm.fraunhofer.de 1 CC-HPC@itwm.fraunhofer.de • What we do: Holistic optimization, dealing with many core (structured) SMP machines (Xeon, Phi) and large (100–500 DSM nodes) and very large machines (1000–5000 DSM nodes). • Staff: 1 / 3 computer scientists, 1 / 3 mathematicians, 1 / 3 physicists, 1 / 3 else • Hardware: Many cores (10 4 –10 6 threads), multiple levels of memory (tape, cold disk, spinning disk, SSD, NVRAM, DRAM, HBM, Cache level 3, 2, 1, SIMD). • Costs: Computation is for free, data transfer dominates energy, latency, throughput. • Software high level: Asynchronous communication, task based programming (abstraction, load balancing, time skewing), framework & DSL. • Software mid level: hybrid process/thread, multi level cache blocking, zero indirection memory layouts, zero copy dependency management. • Software low level: SIMD intrinsics, coroutines (not threads), CAS & lock free. mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  3. Parallel programming: Shining reality. 16 nodes → 1584 nodes = 1 h → 1 min . 2 ⇒ Parallel programming: Shining reality. 16 nodes → 1584 nodes = ⇒ 1 h → 1 min . each of the 512 · 3 · 28 = 43008 cores has about 28 . 5 3 points. 8th order operator Problem size 1000 3 = ⇒ = ⇒ ((28 . 5 − 16) / 28 . 5) 3 ≈ 8 . 5% inner points. = ⇒ Latency is what matters. mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  4. Parallel programming: Reality. 3 Parallel programming: Reality. Server: Client: write /global/config on "setup": broadcast "setup" to all clients read /global/config Some clients might fail: Distributed file system violates POSIX!? Server: Client: write /global/config on "setup": fsync /global broadcast "setup" to all clients read /global/config Stalls cluster for minutes! Still some clients might fail because their local view on meta data is not updated. Server: Client: write /global/config on "setup": broadcast "setup" to all clients while (! exists /global/config ): sleep 1 read /global/config Typically works, so this is industrial production code. mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  5. Parallel programming: Distribute data. Beginner. 4 Parallel programming: Distribute data. Beginner. class EqualDistribution { size_t size_per_rank (); }; EqualDistribution distribution (size , nProc ); offset_t begin = iProc * distribution. size_per_rank (); // type error !? transfer (begin , distribution. size_per_rank ()); BROKEN: 12 elements on 5 ranks = ⇒ 3 elements per rank. ### ### ##? ##? ##? = ⇒ There is no “size per rank”! class ContiguousDistribution { offset_t begin (rank_t ); size_t size (rank_t i) = begin (i+1) - begin (i); // size_t operator - (offset_t , offset_t) }; Discipline! Programmer’s discipline! Teacher’s discipline! mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  6. Parallel programming: Distributed termination detection. Junior. 5 Parallel programming: Distributed termination detection. Junior. Each process i ∈ { 0 , . . . , P − 1 } : on init: t i = r i = 0, on send: t i = t i + 1, on recv: r i = r i + 1. Termination detection uses: bool Comm :: messages_in_flight () { long d = t - r; // note: signed !! long D; MPI_Allreduce (&d, &D, 1, MPI_LONG , MPI_SUM , MPI_COMM_WORLD ); return D != 0; } while (c. messages_in_flight ()) ... // global operation vs. resource utilization CORRECT! But does not scale! Also it mixes messages with control messages. (This is the state of the art in 2017.) mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  7. Parallel programming: Distributed termination detection. Professional. 6 Parallel programming: Distributed termination detection. Professional. Solved, e. g. Friedemann Mattern: Algorithms for distributed termination detection , 1987. ATTENTION: Inconsistent cuts are possible! Termination detection at scale: Asynchronously compute � P − 1 i =0 t i and � P − 1 i =0 r i twice . Termina- tion ⇐ ⇒ all four values are equal. Library? Interface? Transformation? Language construct? mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  8. Parallel programming: GASPI/GPI: Notifications. Professional. 7 Parallel programming: GASPI/GPI: Notifications. Professional. Fine grained synchronization: Remote notification attached to a single message. • write_notify (source, destination, notification) • waitsome (set of notifications) Structured stencil with double buffering: while (! done) { write_notify_to_all_neighbours (); compute_inner_region (); while (!all neighbour data received) { process (neighbour = wait_some (unprocessed neighbours )); } } Communication and computation happen at the same time. Requires a lot of programming discipline. Synthesis! mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  9. Parallel programming: GASPI/GPI: Notifications. Senior/Library writer. 8 Parallel programming: GASPI/GPI: Notifications. Senior/Library writer. Zero copy unstructured nearest neighbor: while (tile = lock_unlocked_and_ready_tile ()) { process (tile ); publish_progress (tile ); // update ready flags of neighbors unlock (tile ); } Task based middle ware: while (! done) { task = get_ready_task (); // blocking , maybe busy process (task ); publish_progress (task ); // might enable other tasks } Dynamic communication patterns. Debugging often a nightmare. Interface design! mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  10. Egor’s tool: Almost ready for the tool chain. 9 Egor’s tool: Almost ready for the tool chain. Implements GPI on top of threads rather than processes. WARNING: ThreadSanitizer: data race (pid=4141) Read of size 4 at 0x7f42f5ffc024 by thread T4 (mutexes: write M101): #0 dump main.c:9 (exe+0x00000006b9bd) #1 main main.c:41 (exe+0x00000006be3f) #2 operator() /devel/src/gpi/gpi_detail/GlobalState.cpp:50 (exe+0x0000000787e4) #3 execute_native_thread_routine /src/gcc-4.8.1/x86_64-unknown-linux-gnu/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:84 Previous write of size 1 at 0x7f42f5ffc027 by thread T38: #0 memcpy /src/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:577 (exe+0x000000028090) #1 operator() /devel/src/gpi/gpi_detail/GlobalState.cpp:188 (exe+0x0000000781c2) #2 gpi_detail::Executor::threadMain() /devel/src/gpi/gpi_detail/Executor.cpp:23 (exe+0x00000007f869) #3 execute_native_thread_routine /src/gcc-4.8.1/x86_64-unknown-linux-gnu/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:84 mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  11. Parallel programming: Intended data race: Weak minimum. Senior. 10 Parallel programming: Intended data race: Weak minimum. Senior. Let M ( t ) = min P − 1 i =0 f i ( t ) where f i is only known to process i and t is a point in time. To compute M ( t ) a barrier is required. = ⇒ Not possible at scale. Note: The barrier latency is not the problem, the problem is the accumulation of imbalances. Additional knowledge: All f i are strictly increasing. = ⇒ M ( t ) is strictly increasing. Easier to compute: Strictly increasing eventually consistent weak minimum W ( t ) ≤ M ( t ): Publish f i ( t ) asynchronously. (Publish wave.) Reduce all values upon request. (Reduction wave.) No synchronization between the waves. = ⇒ Data race. Race is okay as long as f i ( t ) is read/written atomically. Latency stays the same but work can be done asynchronously and therefore the imbalances are smeared out. Detect the race, prove the algorithm is correct with the race! mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  12. Parallel programming: Alltoall. Library writer. 11 Parallel programming: Alltoall. Library writer. @iProc: forall other ranks: async_write (data[other] -> other); wait_for_local_completion(); barrier(); work (received_data); BROKEN: Local completion plus barrier = ⇒ All data has been sent . Unknown whether or not data has been received ! Works fine on Infiniband (non overtaking) but fails on Cray and TCP Ethernet. mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  13. Olaf’s tool: Not ready for the tool chain. 12 Olaf’s tool: Not ready for the tool chain. mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  14. Parallel programming: Alltoall. 13 Parallel programming: Alltoall. @iProc: forall other ranks: async_write_notify (data[other] -> other, notify: data from iProc); outstanding_messages = nProc; while (outstanding_messages --> 0) sender = wait_for_notification(); partial_work (received_data[sender]); work (received_data); • CORRECT. No wait. No barrier. Better overlap. Partial work possible. • nProc many notifications → If memory scaling issue kicks in, then trade with latency. mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

  15. Parallel programming: Reality. Undergraduate using tools. 14 Parallel programming: Reality. Undergraduate using tools. gitfan: parallel efficiency parallel efficiency normalized to 4 nodes 1.25 1 0.75 4x16 8x16 16x16 24x16 32x16 40x16 48x16 #nodes x #threads Legacy symbolic linear algebra now parallel. Tools help! mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

Recommend


More recommend