Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings
Objective To present the most prominent approaches to parallel computer organization. 2 Parallel Processing
Outline Taxonomy MIMD systems Symmetric multiprocessing Time shared bus Cache coherence Multithreading Clusters NUMA systems Vector Computation 3 Parallel Processing
Taxonomy Flynn Instructions being executed simultaneously 1 many data being processed SISD MISD simultaneously 1 múltiplas SIMD MIMD 4 Parallel Processing
Single Instruction, Single Data SISD Single processor Single instruction stream Data stored in single memory Uni-processor (von Neumann architecture) control processing memory instruction data Unit unit unit stream stream 5 Parallel Processing
Single Instruction, Multiple Data SIMD Single machine instruction controls simultaneous processing memory data execution element unit stream control Multiple processing elements data unit processing memory stream Each processing element has element unit … instruction associated data memory stream The same instruction executed by all processing elements but on different set processing memory data of data. element unit stream Main subclasses: vector and array processors. 6 Parallel Processing
Multiple Instruction, Single Data MISD Sequence of data transmitted to a set of processors. Each processor executes a different instruction sequence on different “parts” of the same data sequence !!!!!! Never been implemented. 7 Parallel Processing
Multiple Instruction, Multiple Data MIMD Set of processors control processing instruction data simultaneously unit unit stream stream execute different processing control instruction data instruction sequences. unit unit stream … stream memory Different sets of data. unit Main subclasses: multiprocessors and processing control instruction data unit unit stream stream multicomputers multiprocessor 8 Parallel Processing
Multiple Instruction, Multiple Data MIMD Set of processors processing memory control instruction data simultaneously unit unit unit stream stream Interconnection network execute different control processing instruction memory data instruction sequences. unit unit stream unit … stream Different sets of data. Main subclasses: multiprocessors and processing control instruction data memory unit unit stream stream unit multicomputers multicomputer 9 Parallel Processing
Taxonomy tree Processor organizations Single instructions Single instructions Multiple instructions Multiple instructions single data stream multiple data stream single data stream multiple data stream (SISD) (SIMD) (MISD) (MIMD) Vector Array Multiprocessor Multicomputer Processor Processor shared memory distributed memory (tightly coupled) (loosely coupled) Symmetric Nonuniform Clusters Multiprocessor memory access (SMP) (NUMA) 10 Parallel Processing
Outline Taxonomy MIMD systems Symmetric multiprocessing Time shared bus Cache coherence Multithreading Clusters NUMA systems Vector Computation 11 Parallel Processing
MIMD - Overview Set of general purpose processors. Each can process all instructions necessary Further classified by method of processor communication. 12 Parallel Processing
Communication Models Multiprocessors All CPUs are able to process all necessary instruction. All access the same physical shared memory All share the same address space. Communication through shared memory via LOAD/STORE instructions → tightly coupled . Simple programming model. 13 Parallel Processing
Communication Models Multiprocessors (example) a) Multiprocessor with 16 CPUs sharing a common memory. b) Memory in 16 sections; each one processed by one processor. 14 Parallel Processing
Communication Models Multicomputers Each CPU has a private memory → distributed memory system. Each CPU has a particular address space Communication through send/receive primitives → loosely coupled system. More complex programming model 15 Parallel Processing
Communication Models Multicomputers (example) Multicomputer with 16 CPUs each with its own private memory Image (see previous figure) distributed among the 16 CPUs 16 Parallel Processing
Communication Models Multiprocessors Multicomputers Multiprocessors : Potentially easier to program Building a shared memory for hundreds of CPUs is not easy → non scalable. Memory contention is a potential performance bottleneck. Multicomputers : More difficult to program. Building multicomputers with 1000’s of CPU is not difficult → scalable. 17 Parallel Processing
Outline Taxonomy MIMD systems Symmetric multiprocessing Time shared bus Cache coherence Multithreading Clusters NUMA systems Vector Computation 18 Parallel Processing
Symmetric Multiprocessors A stand alone computer with the following characteristics: Two or more similar processors of comparable capacity. Processors share same memory and I/O. Processors are connected by a bus or other internal connection. Memory access time is approximately the same for each processor. 19 Parallel Processing
SMP Advantages Performance If some work can be done in parallel Availability Since all processors can perform the same functions, failure of a single processor does not necessarily halt the system. Incremental growth User can enhance performance by adding additional processors. Scaling Vendors can offer range of products based on number of processors. 20 Parallel Processing
Outline Taxonomy MIMD systems Symmetric multiprocessing Time shared bus Cache coherence Multithreading Clusters NUMA systems Vector Computation 21 Parallel Processing
Time Shared Bus Characteristics: Simplest form. Structure and interface similar to single processor system Following features provided: Addressing - distinguish modules on bus . Arbitration - any module can be temporary master. Time sharing - if one module has the bus, others must wait and may have to suspend. Now have multiple processors as well as multiple I/O modules. 22 Parallel Processing
Time Shared Bus - SMP 23 Parallel Processing
Time Shared Bus Advantages: Simplicity Flexibility Reliability Disadvantages: Performance limited by bus cycle time Each processor should have local cache Reduce number of bus accesses Leads to problems with cache coherence Solved in hardware - see later 24 Parallel Processing
Outline Taxonomy MIMD systems Symmetric multiprocessing Time shared bus Cache coherence Multithreading Clusters NUMA systems Vector Computation 25 Parallel Processing
Cache Coherence Problem 1- CPU A reads data ( miss ) 2- CPU K reads the same data ( miss ) 3- CPU K writes (changes) data ( hit ) 4- CPU A reads data ( hit ) – outdated !!!!! SHARED MEMORY CPU K CPU A a … . . . x x x x … y Cache A Cache K SHARED BUS 26 Parallel Processing
Snoopy Protocols Cache controllers may have a snoop, which monitors the shared bus to detect any for coherence relevant activity and acts so as to assure data coherence. It increases bus traffic. 27 Parallel Processing
Snoopy Protocols 1- CPU K writes (changes) the data ( hit ) 2- write propagates to the shared memory 3- snoop invalidates or updates data in CPU A SHARED MEMORY CPU K CPU A a … . . . x x x x y x y … x y y Cache A Cache K SHARED BUS 28 Parallel Processing
MESI State Transition Diagram 29 Parallel Processing
L1-L2 Cache Consistency L1 caches do not connect to the bus → do not engage in the snoop protocol. Simple solution: L1 is “ write-through ”. Updates and invalidations in L2 must be propagated to L1. Approaches for write back L1 exist → more complex. 30 Parallel Processing
Cache Coherence connection other than shared bus Directory Protocols Collect and maintain information about copies of data in cache. Typically a central directory stored in main memory. Requests are checked against directory. Appropriate transfers are performed. Creates central bottleneck. Effective in large scale systems with complex interconnection schemes, according to Stallings ?????? 31 Parallel Processing
Cache Coherence Software Solutions Compiler and operating system deal with problem. Overhead transferred to compile time. Design complexity transferred from hardware to software. However, software tends to make conservative decisions Inefficient cache utilization. Analyze code to determine safe periods for caching shared variables. HW+SW solutions exist. 32 Parallel Processing
Outline Taxonomy MIMD systems Symmetric multiprocessing Time shared bus Cache coherence Multithreading Clusters NUMA systems Vector Computation 33 Parallel Processing
Recommend
More recommend