parallel processing
play

Parallel Processing Raul Queiroz Feitosa Parts of these slides are - PowerPoint PPT Presentation

Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To present the most prominent approaches to parallel computer organization. 2 Parallel Processing Outline


  1. Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings

  2. Objective To present the most prominent approaches to parallel computer organization. 2 Parallel Processing

  3. Outline  Taxonomy  MIMD systems  Symmetric multiprocessing  Time shared bus  Cache coherence  Multithreading  Clusters  NUMA systems  Vector Computation 3 Parallel Processing

  4. Taxonomy Flynn Instructions being executed simultaneously 1 many data being processed SISD MISD simultaneously 1 múltiplas SIMD MIMD 4 Parallel Processing

  5. Single Instruction, Single Data SISD  Single processor  Single instruction stream  Data stored in single memory  Uni-processor (von Neumann architecture) control processing memory instruction data Unit unit unit stream stream 5 Parallel Processing

  6. Single Instruction, Multiple Data SIMD  Single machine instruction controls simultaneous processing memory data execution element unit stream control  Multiple processing elements data unit processing memory stream  Each processing element has element unit … instruction associated data memory stream  The same instruction executed by all processing elements but on different set processing memory data of data. element unit stream  Main subclasses: vector and array processors. 6 Parallel Processing

  7. Multiple Instruction, Single Data MISD  Sequence of data transmitted to a set of processors.  Each processor executes a different instruction sequence on different “parts” of the same data sequence !!!!!!  Never been implemented. 7 Parallel Processing

  8. Multiple Instruction, Multiple Data MIMD  Set of processors control processing instruction data simultaneously unit unit stream stream execute different processing control instruction data instruction sequences. unit unit stream … stream memory  Different sets of data. unit  Main subclasses: multiprocessors and processing control instruction data unit unit stream stream multicomputers multiprocessor 8 Parallel Processing

  9. Multiple Instruction, Multiple Data MIMD  Set of processors processing memory control instruction data simultaneously unit unit unit stream stream Interconnection network execute different control processing instruction memory data instruction sequences. unit unit stream unit … stream  Different sets of data.  Main subclasses: multiprocessors and processing control instruction data memory unit unit stream stream unit multicomputers multicomputer 9 Parallel Processing

  10. Taxonomy tree Processor organizations Single instructions Single instructions Multiple instructions Multiple instructions single data stream multiple data stream single data stream multiple data stream (SISD) (SIMD) (MISD) (MIMD) Vector Array Multiprocessor Multicomputer Processor Processor shared memory distributed memory (tightly coupled) (loosely coupled) Symmetric Nonuniform Clusters Multiprocessor memory access (SMP) (NUMA) 10 Parallel Processing

  11. Outline  Taxonomy  MIMD systems  Symmetric multiprocessing  Time shared bus  Cache coherence  Multithreading  Clusters  NUMA systems  Vector Computation 11 Parallel Processing

  12. MIMD - Overview Set of general purpose processors. Each can process all instructions necessary Further classified by method of processor communication. 12 Parallel Processing

  13. Communication Models Multiprocessors  All CPUs are able to process all necessary instruction.  All access the same physical shared memory  All share the same address space.  Communication through shared memory via LOAD/STORE instructions → tightly coupled .  Simple programming model. 13 Parallel Processing

  14. Communication Models Multiprocessors (example) a) Multiprocessor with 16 CPUs sharing a common memory. b) Memory in 16 sections; each one processed by one processor. 14 Parallel Processing

  15. Communication Models Multicomputers  Each CPU has a private memory → distributed memory system.  Each CPU has a particular address space  Communication through send/receive primitives → loosely coupled system.  More complex programming model 15 Parallel Processing

  16. Communication Models Multicomputers (example) Multicomputer with 16 CPUs each with its own private memory Image (see previous figure) distributed among the 16 CPUs 16 Parallel Processing

  17. Communication Models Multiprocessors  Multicomputers  Multiprocessors :  Potentially easier to program  Building a shared memory for hundreds of CPUs is not easy → non scalable.  Memory contention is a potential performance bottleneck.  Multicomputers :  More difficult to program.  Building multicomputers with 1000’s of CPU is not difficult → scalable. 17 Parallel Processing

  18. Outline  Taxonomy  MIMD systems  Symmetric multiprocessing  Time shared bus  Cache coherence  Multithreading  Clusters  NUMA systems  Vector Computation 18 Parallel Processing

  19. Symmetric Multiprocessors A stand alone computer with the following characteristics:  Two or more similar processors of comparable capacity.  Processors share same memory and I/O.  Processors are connected by a bus or other internal connection.  Memory access time is approximately the same for each processor. 19 Parallel Processing

  20. SMP Advantages Performance  If some work can be done in parallel Availability  Since all processors can perform the same functions, failure of a single processor does not necessarily halt the system. Incremental growth  User can enhance performance by adding additional processors. Scaling  Vendors can offer range of products based on number of processors. 20 Parallel Processing

  21. Outline  Taxonomy  MIMD systems  Symmetric multiprocessing  Time shared bus  Cache coherence  Multithreading  Clusters  NUMA systems  Vector Computation 21 Parallel Processing

  22. Time Shared Bus Characteristics:  Simplest form.  Structure and interface similar to single processor system  Following features provided:  Addressing - distinguish modules on bus .  Arbitration - any module can be temporary master.  Time sharing - if one module has the bus, others must wait and may have to suspend.  Now have multiple processors as well as multiple I/O modules. 22 Parallel Processing

  23. Time Shared Bus - SMP 23 Parallel Processing

  24. Time Shared Bus Advantages:  Simplicity  Flexibility  Reliability Disadvantages:  Performance limited by bus cycle time  Each processor should have local cache  Reduce number of bus accesses  Leads to problems with cache coherence  Solved in hardware - see later 24 Parallel Processing

  25. Outline  Taxonomy  MIMD systems  Symmetric multiprocessing  Time shared bus  Cache coherence  Multithreading  Clusters  NUMA systems  Vector Computation 25 Parallel Processing

  26. Cache Coherence Problem 1- CPU A reads data ( miss ) 2- CPU K reads the same data ( miss ) 3- CPU K writes (changes) data ( hit ) 4- CPU A reads data ( hit ) – outdated !!!!! SHARED MEMORY CPU K CPU A a … . . . x x x x … y Cache A Cache K SHARED BUS 26 Parallel Processing

  27. Snoopy Protocols Cache controllers may have a snoop, which  monitors the shared bus to detect any for coherence relevant activity and  acts so as to assure data coherence.  It increases bus traffic. 27 Parallel Processing

  28. Snoopy Protocols 1- CPU K writes (changes) the data ( hit ) 2- write propagates to the shared memory 3- snoop invalidates or updates data in CPU A SHARED MEMORY CPU K CPU A a … . . . x x x x y x y … x y y Cache A Cache K SHARED BUS 28 Parallel Processing

  29. MESI State Transition Diagram 29 Parallel Processing

  30. L1-L2 Cache Consistency  L1 caches do not connect to the bus → do not engage in the snoop protocol.  Simple solution:  L1 is “ write-through ”.  Updates and invalidations in L2 must be propagated to L1.  Approaches for write back L1 exist → more complex. 30 Parallel Processing

  31. Cache Coherence connection other than shared bus Directory Protocols  Collect and maintain information about copies of data in cache.  Typically a central directory stored in main memory.  Requests are checked against directory.  Appropriate transfers are performed.  Creates central bottleneck.  Effective in large scale systems with complex interconnection schemes, according to Stallings ?????? 31 Parallel Processing

  32. Cache Coherence Software Solutions  Compiler and operating system deal with problem.  Overhead transferred to compile time.  Design complexity transferred from hardware to software.  However, software tends to make conservative decisions  Inefficient cache utilization.  Analyze code to determine safe periods for caching shared variables.  HW+SW solutions exist. 32 Parallel Processing

  33. Outline  Taxonomy  MIMD systems  Symmetric multiprocessing  Time shared bus  Cache coherence  Multithreading  Clusters  NUMA systems  Vector Computation 33 Parallel Processing

Recommend


More recommend