Parallel Processing Raul Queiroz Feitosa Parts of these slides are - PowerPoint PPT Presentation

Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings

Objective To present the most prominent approaches to parallel computer organization. 2 Parallel Processing

Outline  Taxonomy  MIMD systems  Symmetric multiprocessing  Time shared bus  Cache coherence  Multithreading  Clusters  NUMA systems  Vector Computation 3 Parallel Processing

Taxonomy Flynn Instructions being executed simultaneously 1 many data being processed SISD MISD simultaneously 1 múltiplas SIMD MIMD 4 Parallel Processing

Single Instruction, Single Data SISD  Single processor  Single instruction stream  Data stored in single memory  Uni-processor (von Neumann architecture) control processing memory instruction data Unit unit unit stream stream 5 Parallel Processing

Single Instruction, Multiple Data SIMD  Single machine instruction controls simultaneous processing memory data execution element unit stream control  Multiple processing elements data unit processing memory stream  Each processing element has element unit … instruction associated data memory stream  The same instruction executed by all processing elements but on different set processing memory data of data. element unit stream  Main subclasses: vector and array processors. 6 Parallel Processing

Multiple Instruction, Single Data MISD  Sequence of data transmitted to a set of processors.  Each processor executes a different instruction sequence on different “parts” of the same data sequence !!!!!!  Never been implemented. 7 Parallel Processing

Multiple Instruction, Multiple Data MIMD  Set of processors control processing instruction data simultaneously unit unit stream stream execute different processing control instruction data instruction sequences. unit unit stream … stream memory  Different sets of data. unit  Main subclasses: multiprocessors and processing control instruction data unit unit stream stream multicomputers multiprocessor 8 Parallel Processing

Multiple Instruction, Multiple Data MIMD  Set of processors processing memory control instruction data simultaneously unit unit unit stream stream Interconnection network execute different control processing instruction memory data instruction sequences. unit unit stream unit … stream  Different sets of data.  Main subclasses: multiprocessors and processing control instruction data memory unit unit stream stream unit multicomputers multicomputer 9 Parallel Processing

Taxonomy tree Processor organizations Single instructions Single instructions Multiple instructions Multiple instructions single data stream multiple data stream single data stream multiple data stream (SISD) (SIMD) (MISD) (MIMD) Vector Array Multiprocessor Multicomputer Processor Processor shared memory distributed memory (tightly coupled) (loosely coupled) Symmetric Nonuniform Clusters Multiprocessor memory access (SMP) (NUMA) 10 Parallel Processing

MIMD - Overview Set of general purpose processors. Each can process all instructions necessary Further classified by method of processor communication. 12 Parallel Processing

Communication Models Multiprocessors  All CPUs are able to process all necessary instruction.  All access the same physical shared memory  All share the same address space.  Communication through shared memory via LOAD/STORE instructions → tightly coupled .  Simple programming model. 13 Parallel Processing

Communication Models Multiprocessors (example) a) Multiprocessor with 16 CPUs sharing a common memory. b) Memory in 16 sections; each one processed by one processor. 14 Parallel Processing

Communication Models Multicomputers  Each CPU has a private memory → distributed memory system.  Each CPU has a particular address space  Communication through send/receive primitives → loosely coupled system.  More complex programming model 15 Parallel Processing

Communication Models Multicomputers (example) Multicomputer with 16 CPUs each with its own private memory Image (see previous figure) distributed among the 16 CPUs 16 Parallel Processing

Communication Models Multiprocessors  Multicomputers  Multiprocessors :  Potentially easier to program  Building a shared memory for hundreds of CPUs is not easy → non scalable.  Memory contention is a potential performance bottleneck.  Multicomputers :  More difficult to program.  Building multicomputers with 1000’s of CPU is not difficult → scalable. 17 Parallel Processing

Symmetric Multiprocessors A stand alone computer with the following characteristics:  Two or more similar processors of comparable capacity.  Processors share same memory and I/O.  Processors are connected by a bus or other internal connection.  Memory access time is approximately the same for each processor. 19 Parallel Processing

SMP Advantages Performance  If some work can be done in parallel Availability  Since all processors can perform the same functions, failure of a single processor does not necessarily halt the system. Incremental growth  User can enhance performance by adding additional processors. Scaling  Vendors can offer range of products based on number of processors. 20 Parallel Processing

Time Shared Bus Characteristics:  Simplest form.  Structure and interface similar to single processor system  Following features provided:  Addressing - distinguish modules on bus .  Arbitration - any module can be temporary master.  Time sharing - if one module has the bus, others must wait and may have to suspend.  Now have multiple processors as well as multiple I/O modules. 22 Parallel Processing

Time Shared Bus - SMP 23 Parallel Processing

Time Shared Bus Advantages:  Simplicity  Flexibility  Reliability Disadvantages:  Performance limited by bus cycle time  Each processor should have local cache  Reduce number of bus accesses  Leads to problems with cache coherence  Solved in hardware - see later 24 Parallel Processing

Cache Coherence Problem 1- CPU A reads data ( miss ) 2- CPU K reads the same data ( miss ) 3- CPU K writes (changes) data ( hit ) 4- CPU A reads data ( hit ) – outdated !!!!! SHARED MEMORY CPU K CPU A a … . . . x x x x … y Cache A Cache K SHARED BUS 26 Parallel Processing

Snoopy Protocols Cache controllers may have a snoop, which  monitors the shared bus to detect any for coherence relevant activity and  acts so as to assure data coherence.  It increases bus traffic. 27 Parallel Processing

Snoopy Protocols 1- CPU K writes (changes) the data ( hit ) 2- write propagates to the shared memory 3- snoop invalidates or updates data in CPU A SHARED MEMORY CPU K CPU A a … . . . x x x x y x y … x y y Cache A Cache K SHARED BUS 28 Parallel Processing

MESI State Transition Diagram 29 Parallel Processing

L1-L2 Cache Consistency  L1 caches do not connect to the bus → do not engage in the snoop protocol.  Simple solution:  L1 is “ write-through ”.  Updates and invalidations in L2 must be propagated to L1.  Approaches for write back L1 exist → more complex. 30 Parallel Processing

Cache Coherence connection other than shared bus Directory Protocols  Collect and maintain information about copies of data in cache.  Typically a central directory stored in main memory.  Requests are checked against directory.  Appropriate transfers are performed.  Creates central bottleneck.  Effective in large scale systems with complex interconnection schemes, according to Stallings ?????? 31 Parallel Processing

Cache Coherence Software Solutions  Compiler and operating system deal with problem.  Overhead transferred to compile time.  Design complexity transferred from hardware to software.  However, software tends to make conservative decisions  Inefficient cache utilization.  Analyze code to determine safe periods for caching shared variables.  HW+SW solutions exist. 32 Parallel Processing

Parallel Processing Raul Queiroz Feitosa Parts of these slides are - PowerPoint PPT Presentation

Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To present the most prominent approaches to parallel computer organization. 2 Parallel Processing Outline

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Chapter 3: Pipelining and Parallel Processing Keshab K. Parhi Outline Introduction

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Architectures for Parallel Processing Current Architectures for Parallel "With the

Parallel Processing in Algebraic Number Theory Bill Hart February 1, 2007 Bill Hart Parallel

Ballot Processing | PP 2016 Ballot Processing | PP 2016 Keys to processing the PP from Heidi Hunt,

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Performance analysis of Stochastic Process Algebra models using Stochastic Simulation Jeremy

Beba BEhavioural BAsed forwarding Giuseppe Bianchi OpenFlows platform The SDN/OpenFlow Model

NEPR208 - Adaptation properties and mechanisms Functional advantages in properties of a neural

Energy issues of GPU computing clusters Stphane Vialle SUPELEC UMI GT CNRS 2958 &

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2

Introduction to Grid Computing Grid School Workshop Module 1 1 Computing Clusters are

CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality

Parallel Processing Raul Queiroz Feitosa Parts of these slides are - PowerPoint PPT Presentation

Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To present the most prominent approaches to parallel computer organization. 2 Parallel Processing Outline

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Chapter 3: Pipelining and Parallel Processing Keshab K. Parhi Outline Introduction

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Architectures for Parallel Processing Current Architectures for Parallel &quot;With the

Parallel Processing in Algebraic Number Theory Bill Hart February 1, 2007 Bill Hart Parallel

Ballot Processing | PP 2016 Ballot Processing | PP 2016 Keys to processing the PP from Heidi Hunt,

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Performance analysis of Stochastic Process Algebra models using Stochastic Simulation Jeremy

Beba BEhavioural BAsed forwarding Giuseppe Bianchi OpenFlows platform The SDN/OpenFlow Model

NEPR208 - Adaptation properties and mechanisms Functional advantages in properties of a neural

Energy issues of GPU computing clusters Stphane Vialle SUPELEC UMI GT CNRS 2958 &amp;

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2

Introduction to Grid Computing Grid School Workshop Module 1 1 Computing Clusters are

CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality

Architectures for Parallel Processing Current Architectures for Parallel "With the

Energy issues of GPU computing clusters Stphane Vialle SUPELEC UMI GT CNRS 2958 &