X10: New opportunities for X10: New opportunities for Compiler-Driven Performance Compiler-Driven Performance via a new Programming Model via a new Programming Model Kemal Ebcioglu Kemal Ebcioglu Vijay Saraswat Vijay Saraswat Vivek Sarkar Vivek Sarkar IBM T.J. Watson Research Center IBM T.J. Watson Research Center {kemal,vsaraswat,vsarkar}@us.ibm.com {kemal,vsaraswat,vsarkar}@us.ibm.com Compiler-Driven Performance Workshop --- CASCON Compiler-Driven Performance Workshop --- CASCON 2004 2004 Oct 6, 2004 Oct 6, 2004 This work has been supported in part by the Defense Advanced Research Projects This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. Agency (DARPA) under contract No. NBCH30390004.
Acknowledgments • Contributors to X10 design & implementation • IBM PERCS Team members ideas: − Research − David Bacon − Systems & Technology Group − Bob Blainey − Software Group − Philippe Charles − PI: Mootaz Elnozahy − Perry Cheng • University partners: − Julian Dolby − Cornell − Kemal Ebcioglu − LANL − − Guang Gao (U Delaware) MIT − − Allan Kielstra Purdue University − RPI − Robert O'Callahan − UC Berkeley − Filip Pizlo (Purdue) − U. Delaware − Christoph von Praun − U. Illinois − V.T.Rajan − U. New Mexico − Lawrence Rauchwerger (Texas A&M) − U. Pittsburgh − Vijay Saraswat (contact for lang. spec.) − UT Austin − Vivek Sarkar − Vanderbilt University − Mandana Vaziri − Jan Vitek (Purdue) V. Sarkar CDP 2004 Workshop 2
Performance and Productivity Challenges facing Future Large-Scale Systems 2) Frequency wall: Multiple layers of 1) Memory wall: Severe non- hierarchical heterogeneous uniformities in bandwidth & parallelism to compensate for latency in memory hierarchy slowdown in frequency scaling Clusters (scale-out) Proc Cluster Proc Cluster PEs, PEs, PEs, PEs, SMP L1 $ L1 $ L1 $ L1 $ . . . . . . . . . Multiple cores on a chip L2 Cache L2 Cache Coprocessors (SPUs) . . . SMTs SIMD ILP L3 Cache . . . . . . 3) Scalability wall: Software will need Memory to deliver ~ 10 5 -way parallelism to utilize large-scale parallel systems V. Sarkar CDP 2004 Workshop 3
IBM PERCS Project (Productive Easy-to-use Reliable Computing Systems) Increase Increase development Increase number of overall applications written productivity productivity PERCS Programming Tools performance-guided parallelization and transformation, static & dynamic checking, separation of concerns --- all integrated Increase into a single development environment (Eclipse) performance of applications MPI PERCS Programming Model OpenMP Static and Dynamic Compilers for base language w/ programming model extensions Mature languages: C/C++, Fortran, Java Increase Experimental languages: X10, UPC, S treamIt, HTA/ Matlab execution productivity Language Runtime + Dynamic Compilation + Continuous Optimization PERCS System Software (K42) PERCS System Hardware V. Sarkar CDP 2004 Workshop 4
Limitations in exploiting Compiler-Driven Performance in Current Parallel Programming Models • MPI: Local memories + message-passing − Parallelism, locality, and “global view” are completely managed by programmer − Communication, synchronization, consistency operations specified at low level of abstraction � Limited opportunities for compiler optimizations • Java threads, OpenMP: shared-memory parallel programming model − Uniform symmetric view of all shared data − Non-transparent performance --- programmer cannot manage data locality and thread affinity at different hierarchy levels (cluster, SMT, …) � Limited effectiveness of compiler optimizations • HPF, UPC: partitioned global address space + SPMD execution model − User specifies data distribution & parallelism, compiler generates communications using owner-computes rule − Large overheads in accessing shared data; compiler optimizations can help applications with simple data access patterns � Limited applicability of compiler optimizations V. Sarkar CDP 2004 Workshop 5
X10 Design Guidelines: Design for Productivity & Compiler/Runtime-driven Performance • • Support common parallel Start with state-of-the-art OO programming idioms language primitives as foundation − Data parallelism − No gratuitous changes − Control parallelism − Build on existing skills − Divide-and-conquer − Producer-consumer / streaming • Raise level of abstraction for − Message-passing constructs that should be amenable to optimized implementation • − Monitors � atomic sections Ensure that every program has a well-defined semantics − Threads � async activities − Independent of implementation − Barriers � clocks − Simple concurrency model & memory model • Introduce new constructs to model hierarchical parallelism and non- • Defer fault tolerance and reliability uniform data access issues to lower levels of system − Places − Assume tightly-coupled system − Distributions with dedicated interconnect V. Sarkar CDP 2004 Workshop 6
Logical View of X10 Programming Model (Work in progress) Place Place Inbound Outbound async async Partitioned Global heap Partitioned Global heap requests requests Place-local heap Place-local heap Granularity of Value place can range Activities & . . . Activities & Class from single h/w Activity-local storage Activity-local storage Instances thread to an entire scale-up system heap heap heap heap stack . . . stack stack . . . stack Inbound Outbound control control control control async async replies replies • Activities can be created by • Place = collection of resident activities and data − async statements (one-way msgs) − Maps to a data-coherent unit in a − future expressions large scale system − foreach & ateach constructs • Four storage classes: • Activities are coordinated by − Partitioned global − Unconditional atomic sections − Place-local − Conditional atomic sections − Activity-local − Clocks (generalization of barriers) − Value class instances − Force (for result of future) • Can be copied/migrated freely V. Sarkar CDP 2004 Workshop 7
Async activities: abstraction of threads • • Async statement Async expression (future) − async(P){S}: run S at place P − F = future(P){E}, or F = future(D){E} : Return − async(D){S}: run S at place the value of expression E, containing datum D evaluated in place P (or the − S may contain local atomic place containing datum D) operations or additional async − force F or !F : suspend until activities for same/different places. value is known • Example: percolate process to data. • Example: percolate data to process . public void put(K key, V value) { public ^V get(K key) { int hash = key.hashCode()% D.size; int hash = key.hashCode()% D.size; async (D[hash]) { return future (D[hash]) { for (_ b = buckets[hash]; b != null; b = b.next) { for (_ b = buckets[hash]; b != null; b = b.next) { if (b.k.equals(key)) { if (b.k.equals(key)) { b.v = value; return b.v; return; } } } } return new V(); buckets[hash] = } new Bucket<K,V>(key, value, buckets[hash]); } }; } Distributed hash-table example V. Sarkar CDP 2004 Workshop 8
RandomAccess (GUPS) example public void run(int a[] blocked, int seed[] cyclic, int value smallTable[]) { ateach (start : seed clocked c) { int ran = start; for (int count : 1.. N_UPDATES/place.MAX_PLACES) { ran = Math.random(ran); int j = F(ran); // function F() can be in C/Fortran int k = smallTable[g(ran)]; async (a[j]) atomic {a[j]^=k;} } // for } // ateach next c; } V. Sarkar CDP 2004 Workshop 9
Regions and Distributions • • Distributions Regions − Map region elements to places − The domain of some array; a collection of array indices • distribution D = cyclic(R); − Domain and range restriction: − region R = [0..99]; • distribution D2 = D | R; − region R2 = [0..99,0..199]; • distribution D3 = D | P; • Region operators • Regions/Distributions can be used − region Intersect = R3 && like type and place parameters R4; − <region R, distribution D> − region Union = R3 || R4; void m(...) − Etc. V. Sarkar CDP 2004 Workshop 10
ArrayCopy example: example of high- level optimizations of async activities Version 1 (orginal): Version 3 (further optimized): <value T, D, E> public static void <value T, D, E> public static void arrayCopy( T[D] a, T[E] b) { arrayCopy( T[D] a, T[E] b) { // Spawn an activity for each index to // Spawn one activity per D-place and one // fetch and copy the value // future per place p to which E maps an ateach (i : D.region) // index in (D | here). a[i] = async b[i]; ateach ( D.places ) { next c; // Advance clock region LocalD = (D | here).region; } ateach ( p : E[LocalD] ) { region RemoteE = (E | p).region; Version 2 (optimized): region Common = <value T, D, E> public static void LocalD && RemoteE; arrayCopy( T[D] a, T[E] b) { a[Common] = async b[Common]; // Spawn one activity per place } ateach ( D.places ) } for ( j : D | here ) next c; // Advance clock a[i] = async b[i]; } next c; // Advance clock } V. Sarkar CDP 2004 Workshop 11
Recommend
More recommend