portjng the miniaero applicatjon to charm
play

Portjng the MiniAero Applicatjon to Charm++ David S. Hollman, Janine - PowerPoint PPT Presentation

Lessons Learned from Portjng the MiniAero Applicatjon to Charm++ David S. Hollman, Janine Bennetu (PI), Jeremiah Wilke (Chief Architect), Ken Franko, Hemanth Kolla, Paul Lin, Greg Sjaardema, Nicole Slatuengren, Keita Teranishi, Nikhil Jain, Eric


  1. Communicatjon: ghost exchanges, unstructured mesh What is MiniAero? MiniAero is a proxy app illustratjng the common computatjon and communicatjon patuerns in unstructured mesh codes of interest to Sandia 3D, unstructured, finite volume computatjonal fluid dynamics code Uses Runge-Kutua fourth-order tjme marching Has optjons for first and second order spatjal discretjzatjon Includes inviscid Roe and viscous Newtonian flux optjons Baseline applicatjon is about 3800 lines of C++ code using MPI and Kokkos Very litule task parallelism; mostly a data parallel problem May 7, 2015 5

  2. What is MiniAero? MiniAero is a proxy app illustratjng the common computatjon and communicatjon patuerns in unstructured mesh codes of interest to Sandia 3D, unstructured, finite volume computatjonal fluid dynamics code Uses Runge-Kutua fourth-order tjme marching Has optjons for first and second order spatjal discretjzatjon Includes inviscid Roe and viscous Newtonian flux optjons Baseline applicatjon is about 3800 lines of C++ code using MPI and Kokkos Very litule task parallelism; mostly a data parallel problem Communicatjon: ghost exchanges, unstructured mesh May 7, 2015 5

  3. March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Portjng MiniAero to Charm++ began with a “bootcamp”: Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status May 7, 2015 6

  4. running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: May 7, 2015 6

  5. running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 May 7, 2015 6

  6. running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida May 7, 2015 6

  7. running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance May 7, 2015 6

  8. running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme May 7, 2015 6

  9. running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: May 7, 2015 6

  10. most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: running and passes test suite May 7, 2015 6

  11. SMP version does not work (Kokkos incompatjbility) Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: running and passes test suite most immediately apparent optjmizatjons done May 7, 2015 6

  12. Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) May 7, 2015 6

  13. Outline Introductjon The Process: Portjng an Explicit Aerodynamics Miniapp to the Chare Model What was easy? What was harder? Preliminary Results and Performance Next Steps May 7, 2015 7

  14. : CkStartCheckpoint(...) : CkStartMemCheckpoint(...) (synchronous) To disk In memory (to partner node) Must be done synchronously Load balancing 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } } this ->AtSync(); // Do the actual load rebalancing serial { // Called when load balancing is completed (required) What was easy? May 7, 2015 8

  15. : CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Both of these require only serializatjon on the user side // Called when load balancing is completed (required) when ResumeFromSync() { } this ->AtSync(); // Do the actual load rebalancing serial { } What was easy? Load balancing May 7, 2015 8

  16. : CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Both of these require only serializatjon on the user side when ResumeFromSync() { } // Called when load balancing is completed (required) } this ->AtSync(); // Do the actual load rebalancing serial { What was easy? Load balancing (synchronous) May 7, 2015 8

  17. : CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } serial { // Do the actual load rebalancing this ->AtSync(); } // Called when load balancing is completed (required) What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } May 7, 2015 8

  18. : CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } // Called when load balancing is completed (required) } this ->AtSync(); // Do the actual load rebalancing serial { What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng May 7, 2015 8

  19. : CkStartMemCheckpoint(...) : CkStartCheckpoint(...) In memory (to partner node) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } // Called when load balancing is completed (required) } this ->AtSync(); // Do the actual load rebalancing serial { What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk May 7, 2015 8

  20. : CkStartMemCheckpoint(...) In memory (to partner node) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) when ResumeFromSync() { } serial { // Do the actual load rebalancing this ->AtSync(); } What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) May 7, 2015 8

  21. : CkStartMemCheckpoint(...) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) } serial { // Do the actual load rebalancing this ->AtSync(); when ResumeFromSync() { } What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node) May 7, 2015 8

  22. Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) } serial { // Do the actual load rebalancing when ResumeFromSync() { } this ->AtSync(); What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) May 7, 2015 8

  23. Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) this ->AtSync(); serial { when ResumeFromSync() { } // Do the actual load rebalancing } What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) Must be done synchronously May 7, 2015 8

  24. Both of these require only serializatjon on the user side // Called when load balancing is completed (required) // Do the actual load rebalancing when ResumeFromSync() { } } serial { this ->AtSync(); What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding May 7, 2015 8

  25. // Called when load balancing is completed (required) // Do the actual load rebalancing when ResumeFromSync() { } } this ->AtSync(); serial { What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side May 7, 2015 8

  26. sends become message-like functjon calls on proxies of array members receives become when clauses Just like in MPI, data dependencies are expressed in terms of messages: “Quick start” implementatjon: map one chare to one MPI process int ndata, double data[ndata]); ); generate data(); /* ... */ thisProxy[partner].receive_data( n_send, data when receive_data( int partner, }; entry void do_stuff_2() { int n_send, double * other_data) serial { memcpy(other_data_, data, entry void do_stuff_1() { }; entry void receive_data( int src, array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } MPI ⇒ Charm++ is relatjvely straightgorward MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  27. sends become message-like functjon calls on proxies of array members receives become when clauses Just like in MPI, data dependencies are expressed in terms of messages: int ndata, double data[ndata]); n_send, data entry void do_stuff_1() { generate data(); /* ... */ thisProxy[partner].receive_data( entry void do_stuff_2() { ); }; entry void receive_data( int src, when receive_data( int partner, int n_send, double * other_data) serial { } }; array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { memcpy(other_data_, data, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  28. sends become message-like functjon calls on proxies of array members receives become when clauses int ndata, double data[ndata]); array [ 1D ] NewStuffDoer { int n_send, double * other_data) when receive_data( int partner, entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { use_other_data(); entry void receive_data( int src, } }; memcpy(other_data_, data, use_other_data(); } MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { }; serial { MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  29. receives become when clauses int ndata, double data[ndata]); }; when receive_data( int partner, entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { memcpy(other_data_, data, entry void receive_data( int src, array [ 1D ] NewStuffDoer { } serial { use_other_data(); use_other_data(); MPI_Send(other_data, n_recv, } MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { }; }; int n_send, double * other_data) MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  30. int ndata, double data[ndata]); } entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { serial { entry void receive_data( int src, array [ 1D ] NewStuffDoer { }; use_other_data(); int n_send, double * other_data) memcpy(other_data_, data, MPI_Send(other_data, n_recv, use_other_data(); MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } }; }; when receive_data( int partner, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  31. int ndata, double data[ndata]); } entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { serial { entry void receive_data( int src, array [ 1D ] NewStuffDoer { }; use_other_data(); int n_send, double * other_data) memcpy(other_data_, data, MPI_Send(other_data, n_recv, use_other_data(); MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } }; }; when receive_data( int partner, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  32. int ndata, double data[ndata]); } entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { serial { entry void receive_data( int src, array [ 1D ] NewStuffDoer { }; use_other_data(); int n_send, double * other_data) memcpy(other_data_, data, MPI_Send(other_data, n_recv, use_other_data(); MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } }; }; when receive_data( int partner, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  33. statjc variables conditjonal communicatjon a lot of size and metadata communicatjon setup can be skipped /* ... */ generate data(); }; thisProxy[partner].receive_data( n_send, data ); entry void do_stuff_2() { }; int ndata, double data[ndata]); when receive_data( int partner, int n_send, double * other_data) serial { memcpy(other_data_, data, entry void do_stuff_1() { }; entry void receive_data( int src, array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  34. conditjonal communicatjon a lot of size and metadata communicatjon setup can be skipped /* ... */ n_send, data entry void do_stuff_1() { generate data(); } thisProxy[partner].receive_data( entry void do_stuff_2() { ); }; entry void receive_data( int src, when receive_data( int partner, int n_send, double * other_data) serial { int ndata, double data[ndata]); }; array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { memcpy(other_data_, data, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: statjc variables MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  35. a lot of size and metadata communicatjon setup can be skipped /* ... */ array [ 1D ] NewStuffDoer { int n_send, double * other_data) when receive_data( int partner, entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( use_other_data(); generate data(); entry void do_stuff_1() { int ndata, double data[ndata]); entry void receive_data( int src, } }; memcpy(other_data_, data, use_other_data(); } MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { }; serial { MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: statjc variables conditjonal communicatjon MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  36. /* ... */ } entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( serial { generate data(); entry void do_stuff_1() { int ndata, double data[ndata]); entry void receive_data( int src, array [ 1D ] NewStuffDoer { }; use_other_data(); int n_send, double * other_data) memcpy(other_data_, data, MPI_Send(other_data, n_recv, use_other_data(); MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } }; }; when receive_data( int partner, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: statjc variables conditjonal communicatjon a lot of size and metadata communicatjon setup can be skipped MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

  37. “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) Two approaches: Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? May 7, 2015 10

  38. “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: May 7, 2015 10

  39. “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s May 7, 2015 10

  40. “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file May 7, 2015 10

  41. “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… May 7, 2015 10

  42. “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal May 7, 2015 10

  43. “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? May 7, 2015 10

  44. MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) May 7, 2015 10

  45. handles memory layout and loop structure to produce optjmized kernels on multjple devices Explicitly listjng all specializatjons can get out of hand quickly. For instance… Kokkos is a performance portability layer aimed primarily at on-node parallelism Applicatjon developer implements generic code, Kokkos library implements device-specific specializatjons MiniAero was originally writuen in “MPI+Kokkos” What happens when you need to write templated code that uses Kokkos? 1 template < typename Device> 2 struct ddot { 3 4 5 6 7 8 9 ) : A(A_in), B(B_in), result(0) 10 11 12 13 result += A(i) * B(i); 14 15 }; 16 17 void do_stuff() { 18 19 20 21 22 23 } Kokkos::parallel_for( num_items, ddot<Kokkos::Cuda>(v1, v2) const Kokkos::View<Device>& A_in, /* ... */ } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in ddot( double result; const Kokkos::View<Device>& A, B; ); What was harder: Kokkos integratjon and templated code May 7, 2015 11

  46. Explicitly listjng all specializatjons can get out of hand quickly. For instance… 1 template < typename Device> 2 struct ddot { 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in } ddot( ddot<Kokkos::Cuda>(v1, v2) double result; const Kokkos::View<Device>& A, B; /* ... */ ); Kokkos::parallel_for( num_items, const Kokkos::View<Device>& A_in, What was harder: Kokkos integratjon and templated code Kokkos is a performance portability layer aimed primarily at on-node parallelism May 7, 2015 11

  47. Explicitly listjng all specializatjons can get out of hand quickly. For instance… 1 template < typename Device> 2 struct ddot { 3 4 5 6 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } } /* ... */ const Kokkos::View<Device>& A_in, inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in num_items, ddot( double result; const Kokkos::View<Device>& A, B; ddot<Kokkos::Cuda>(v1, v2) ); Kokkos::parallel_for( What was harder: Kokkos integratjon and templated code Kokkos is a performance portability layer aimed primarily at on-node parallelism handles memory layout and loop structure to produce optjmized kernels on multjple devices May 7, 2015 11

  48. Explicitly listjng all specializatjons can get out of hand quickly. For instance… 1 template < typename Device> 2 struct ddot { 3 4 5 6 7 8 9 ) : A(A_in), B(B_in), result(0) 10 11 12 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } double result; ); } ddot<Kokkos::Cuda>(v1, v2) { } inline void operator ()( int i) { num_items, Kokkos::parallel_for( const Kokkos::View<Device>& B_in /* ... */ ddot( const Kokkos::View<Device>& A, B; const Kokkos::View<Device>& A_in, What was harder: Kokkos integratjon and templated code Kokkos is a performance portability layer aimed primarily at on-node parallelism handles memory layout and loop structure to produce optjmized kernels on multjple devices Applicatjon developer implements generic code, Kokkos library implements device-specific specializatjons May 7, 2015 11

  49. Explicitly listjng all specializatjons can get out of hand quickly. For instance… MiniAero was originally writuen in “MPI+Kokkos” What happens when you need to write templated code that uses Kokkos? const Kokkos::View<Device>& A_in, ddot( const Kokkos::View<Device>& B_in const Kokkos::View<Device>& A, B; { } inline void operator ()( int i) { } /* ... */ Kokkos::parallel_for( num_items, ddot<Kokkos::Cuda>(v1, v2) ); double result; What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 15 }; 16 17 void do_stuff() { 18 19 20 21 22 23 } May 7, 2015 11

  50. Explicitly listjng all specializatjons can get out of hand quickly. For instance… What happens when you need to write templated code that uses Kokkos? const Kokkos::View<Device>& B_in double result; ddot( const Kokkos::View<Device>& A_in, const Kokkos::View<Device>& A, B; inline void operator ()( int i) { } /* ... */ Kokkos::parallel_for( num_items, ddot<Kokkos::Cuda>(v1, v2) ); { } What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { 18 19 20 21 22 23 } May 7, 2015 11

  51. Explicitly listjng all specializatjons can get out of hand quickly. For instance… const Kokkos::View<Device>& A_in, ); ddot<Kokkos::Cuda>(v1, v2) num_items, Kokkos::parallel_for( /* ... */ } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in ddot( double result; const Kokkos::View<Device>& A, B; What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } May 7, 2015 11

  52. const Kokkos::View<Device>& A_in, const Kokkos::View<Device>& A, B; ); ddot<Kokkos::Cuda>(v1, v2) num_items, Kokkos::parallel_for( /* ... */ } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in ddot( double result; What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 Explicitly listjng all specializatjons 22 can get out of hand quickly. For 23 } instance… May 7, 2015 11

  53. 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; , so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . Each communicates a different Kokkos::View type Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; /* etc... */ /* ... */ : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. May 7, 2015 12

  54. 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . , so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); : public CBase_RK4Solver<Device> Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; /* ... */ /* etc... */ Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type May 7, 2015 12

  55. 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); /* ... */ Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; /* etc... */ : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: May 7, 2015 12

  56. 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . /* etc... */ /* ... */ Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); May 7, 2015 12

  57. 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . Kokkos::View<Device, double *,5,3> m_data2; : public CBase_RK4Solver<Device> Kokkos::View<Device, double *,5> m_data1; /* etc... */ /* ... */ Kokkos::View<Device, int *> m_data3; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: May 7, 2015 12

  58. 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 4 5 { 6 7 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5> m_data1; /* ... */ Kokkos::View<Device, double *,5,3> m_data2; /* etc... */ : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.ci */ 2 template < typename Device> 3 array [ 1D ] RK4Solver { 4 5 }; May 7, 2015 12

  59. The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . : public CBase_RK4Solver<Device> /* ... */ /* etc... */ Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, double *,5> m_data1; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; May 7, 2015 12

  60. That already leads to 20 different explicit signatures for receive_ghost_data() . : public CBase_RK4Solver<Device> /* ... */ /* etc... */ Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, double *,5> m_data1; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP May 7, 2015 12

  61. : public CBase_RK4Solver<Device> /* ... */ /* etc... */ Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, double *,5> m_data1; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . May 7, 2015 12

  62. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  63. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  64. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  65. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  66. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  67. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  68. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  69. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  70. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  71. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  72. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  73. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  74. /* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

  75. template < typename Device> size*sizeof( double )); /* comm_stuff.ci */ }; array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup Is this ideal? Obviously not recv_buffers_[src] = May 7, 2015 13

  76. array [ 1D ] CommStuffDoer { size*sizeof( double )); /* comm_stuff.ci */ template < typename Device> }; entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup Is this typical of the effort required to make templated code work with an asynchronous many-task runtjme system (AMT RTS)? Maybe recv_buffers_[src] = May 7, 2015 13

  77. array [ 1D ] CommStuffDoer { size*sizeof( double )); /* comm_stuff.ci */ template < typename Device> }; entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup Does Charm++ even support templated entry methods inside templated chares? (We couldn't figure out how to do it) recv_buffers_[src] = May 7, 2015 13

  78. What happens first? Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. How does the programmer who didn't write do_stuff_1() know this? 1 entry void do_stuff() { 2 3 4 5 6 }; 1 entry void EM_do_stuff() { 2 3 4 5 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { } EM_do_stuff_1(); do_stuff_2(); do_stuff_2(); } serial { do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls May 7, 2015 14

  79. What happens first? Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. 1 entry void EM_do_stuff() { 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { do_stuff_1(); do_stuff_2(); } do_stuff_2(); } serial { EM_do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls 1 entry void do_stuff() { 2 3 4 5 6 }; May 7, 2015 14

  80. Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) What happens first? Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. 1 entry void EM_do_stuff() { 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code } serial { do_stuff_2(); serial { do_stuff_2(); EM_do_stuff_1(); } do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. 1 entry void do_stuff() { 2 3 4 5 6 }; May 7, 2015 14

  81. Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. 1 entry void EM_do_stuff() { 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { } serial { do_stuff_2(); EM_do_stuff_1(); do_stuff_2(); } do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. 1 entry void do_stuff() { 2 What happens first? 3 4 5 6 }; May 7, 2015 14

  82. Perhaps using naming conventjons? (e.g., EM_*() ) 1 entry void EM_do_stuff() { Now what happens first? 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { serial { do_stuff_1(); EM_do_stuff_1(); } do_stuff_2(); } do_stuff_2(); Distjnguishing Entry Methods from Regular Method Calls Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. 1 entry void do_stuff() { 2 What happens first? 3 4 Now suppose do_stuff_1() 5 is an entry method and 6 }; do_stuff_2() is a normal method. May 7, 2015 14

Recommend


More recommend