Architectures for Parallel Processing Current Architectures for Parallel "With the development of new kinds of equipment of Processing greater capacity, and particularly of greater speed, it is almost certain that new methods will have to be developed in order to make the fullest use of this new equipment. It is António Lira Fernandes necessary not only to design machines for the MICEI'0304 mathematics, but also to develop a new mathematics for Universidade do Minho the machines." Douglas Rayner Hartree, 1952 Outline Par Parallel C llel Comput mputing - ing - What hat is is it? it? • Introduction • Parallel computing is when a program uses concurrency • Taxonomy to either: • Memory Models —decrease the runtime for the solution to a problem. —Increase the size of the problem that can be solved. • Bus/ Interconnected • Parallel Computing gives you more performance to • Programming Models throw at your problems. • Top500
Taxonomy of Parallel Processor Current Parallel Approaches Current Parallel Approaches Architectures (Flynn) • Single computers – multiple processing elements — tightly-coupled system (SMP & ccNUMA) • Clusters — loosely-coupled system • Fusion of the two — hybrid-coupled system (Super Clusters) Four Architectures Outline • Introduction SISD • Taxonomy • Memory Models SIMD — Shared — Distributed • Bus/ Interconnected MIMD (tightly • Programming Models coupled) • Top500 MIMD (loosely IS = instruction stream DS = data stream coupled) CU = control unit MU = memory unit PU = processing unit LM = local memory PE = processor element
Memory Models Memory Models • Distributed memory • Why NUMA architecture? — UMA system bus gets saturated (if too much traffic) • Shared-memory — UMA crossbar gets too complex (too expensive) — Uniform Memory Access (UMA) — UMA architecture does not scale beyond a certain level — Non-Uniform Memory Access (NUMA) – (distributed shared-memory) • Typical NUMA problems — High synchronization costs (of subsystem interconnect) — High memory access latencies (some times not) — Might need memory sensitive strategies UMA – loose shared-memory advantage NUMA Outline Memory Models • Introduction • Interconnected “von Neumann” computers by Ethernet, Myrinet, FDDI, ATM • Taxonomy • Distributed Memory, i.e. Summit Beowulf • Memory Models • Heterogeneous mixture of processors — Shared • Less Expensive — Distributed • LANs and WANs are also being used, but the • Bus/ Interconnected communication costs are higher. • Programming Models • Top500 Cluster
Clusters Beowulf NUMA vs. cluster computing NUMA vs. cluster computing • First cluster - Beowulf was developed in 1994 by Thomas Sterling and Don Becker, NASA researchers. • NUMA can be viewed as a very tightly coupled form of • Total performance: 60 Mflops. cluster computing. • 16 nodos with the follow configuration: • Using an cluster architecture a NUMA can be — 486DX4 100MHz (performance: 4,5 Mflops); implemented entirely in software — 256KB Cache; — 16MB RAM; — HD 540MB; — Ethernet network. Outline Time Shared Bus • Simplest form • Introduction • Structure and interface similar to single • Taxonomy processor system • Memory Models • Following features provided • Bus/ Interconnected —Arbitration - any module can be temporary master —Bus —Time sharing - if one module has the bus, others – Time shared or common bus must wait and may have to suspend – Multiport memory • Now have multiple processors as well as – Central control unit multiple I/O modules —Interconnection • Programming Models • Top500
Shared Bus Multiport Memory • Direct independent access of memory modules by each processor • Logic required to resolve conflicts • Little or no modification to processors or modules required • Advantages and Disadvantages... Multiport Memory Diagram Outline • Introduction • Taxonomy • Memory Models • Bus/ Interconnected —Bus —Interconnection – Static – Dynamic • Programming Models • Top500
Static Interconnected Dynamic Interconnected • Cube • Paths are established as needed —Bus based, SGI Power Challenge • Mesh, Intel Paragon —Crossbar • Tree, Thinking Machine CM-5 —Multistage Networks Bus vs Network The Hardware is in great shape The Hardware is in great shape João Luís Sobral 2002 Tim Mattson
Outline Programming Models Programming Models • Introduction • Message-passing (PVM, MPI) — Individual processes exchange messages • Taxonomy — Works on clusters and on parallel computers (topology • Memory Models transparent to user) — Manual – transform to parallel • Bus/ Interconnected • Programming Models • Threading (OpenMP/threads) — Message-passing (PVM, MPI) — Efficient only on shared memory systems — Threading (OpenMP/threads) — One process (environment), multiple threads • Top500 — Cheap, implicit communication — Different scheduling approaches — Limited (semi-) automatic – transform to parallel Writing a parallel application Writing a parallel application Outline • Introduction • Taxonomy • Memory Models • Bus/ Interconnected • Programming Models — Message-passing (PVM, MPI) — Threading (OpenMP/threads) • Top500
MPI MPI • Pros: • MPI 1 (1994) and later MPI 2 (1997) is designed as a communication API for multi-processor computers. — Very portable — Requires no special compiler • Passing messages between processes — Requires no special hardware but can make use of high • Implemented using a communication library of the performance hardware vendor of the machine. — Very flexible - can handle just about any model of parallelism — No shared data! (You don’t have to worry about processes • Adds an abstraction level between the user and this "treading on each other's data" by mistake.) vendor library, to guarantee the portability of the — Can download free libraries (Linux PC) program code. — Forces you to decomposing your problem. • Work on heterogeneous workstation clusters • Cons: • High-performance communication on large multi- — All-or-nothing parallelism (difficult to incrementally transform to parallel the existing serial codes) processors — No shared data - Requires distributed data structures • Rich variety of communication mechanisms. — Could be thought of assembler for parallel computing - you generally have to write more code — Partitioning operations on distributed arrays can be messy. Outline OpenMP • Introduction • Is an API for multithreaded applications. — A set of compiler directives, library routines and environment • Taxonomy variables. • Memory Models • Initiated specification (basic loop-based parallelism) in • Bus/ Interconnected — Fortran (77 and up), C, and C++. • Is fork-join model of parallel execution. • Programming Models • Usually used to parallelize loops. (consuming loops) — Message-passing (PVM, MPI) • Threads communicate by sharing variables — Threading (OpenMP/threads) • To control race conditions we use synchronization to • Top500 protect data conflicts. (Synchronization is expensive so - change how data) • Is available for a variety of platforms.
Fork-Join Parallelism Fork-Join Parallelism : OpenMP OpenMP • Master thread spawns a team of threads as needed. • Pros: — Incremental parallelism - can transform to parallel existing serial • Parallelism is added incrementally: i.e. the sequential codes one bit at a time program evolves into a parallel program. — Quite simple set of directives — Shared data — Partitioning operations on arrays is very simple. • Cons: — Requires proprietary compilers — Requires shared memory multiprocessors — Shared data — Having to think about what data is shared and what data is private — Generally not as scalable (more synchronization points) — Not well-suited for non-trivial data structures like linked lists, trees etc MPI MPI vs OpenMP vs OpenMP Why Hybrid Why Hybrid • Hybrid MPI/OpenMP paradigm is the software trend for • Pure MPI • Pure OpenMP clusters of SMP architectures. — Pro: — Pro: • Elegant in concept and architecture: – Portable to distributed and – Easy to implement parallelism shared memory machines. – Low latency, high bandwidth — using MPI across nodes – Scales beyond one node – Implicit Communication — and OpenMP within nodes. – No data placement problem – Coarse and fine granularity — Good usage of shared memory system resource (memory, — Con: – Dynamic load balancing latency, and bandwidth). — Con: – Difficult to develop and • Avoids the extra communication overhead with MPI debug – Only on shared memory within node. – High latency, low bandwidth machines – Explicit communication • OpenMP adds fine granularity (larger message sizes) – Scale within one node – Large granularity and allows increased and/or dynamic load balancing. – Possible data placement – Difficult load balancing problem • Some problems have two-level parallelism naturally. – No specific thread order • Some problems could only use restricted number of MPI tasks. • Could have better scalability than both pure MPI and pure OpenMP.
Recommend
More recommend