CSC2/458 Parallel and Distributed Systems Machines and Models - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS

Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics

Goals What is the goal of parallel programming?

Scalability Why is scalability important?

Speedup Speedup ( n ) = T 1 T n • T 1 is time on one processor • T n is time on n processors

Amdahl’s Law Let: • T 1 be T serial + T parallelizable • T n is then T serial + T parallelizable , assuming perfect scalability n Divide both terms T 1 and T n by T 1 to obtain serial and parallelizable ratios. 1 Speedup ( n ) = r serial + r parallelizable n

Amdahl’s Law – In the limit 1 Speedup ( ∞ ) = r serial This is also known as strong scalability – work is fixed and number of processors is varied. What are the implications of this?

Scalability Limits Assuming infinite processors, what is the speedup if: • serial ratio r serial is 0.5 (i.e. 50%) • serial ratio is 0.1 (i.e. 10%) • serial ratio is 0.01 (i.e. 1%)

Current Top 5 supercomputers • Sunway TaihuLight (10.6M cores) • Tianhe 2 (3.1M cores) • Piz Daint (361K cores) • Gyoukou (19.8M cores) • Titan (560K cores) Source: Top 500

Weak Scalability • Work increases as number of processors increase • Parallel work should increase linearly with processors • Work W = α W + (1 − α ) W • α is serial fraction of work • Scaled Work W ′ = α W + n (1 − α ) W • Empirical observation • Usually referred to as Gustafson’s Law Source: http://www.johngustafson.net/pubs/pub13/amdahl.htm

Organization of Parallel Computers Components of parallel machines: • Processing Elements • Memories • Interconnect • how processors, memories are connnected to each other

Flynn’s Taxonomy • Based on notion of “streams” • Instruction stream • Data stream • Taxonomy based on number of each type of streams • Single Instruction - Single Data (SISD) • Single Instruction - Multiple Data (SIMD) • Multiple Instruction - Single Data (MISD) • Multiple Instruction - Multiple Data (MIMD) Flynn, J., (1966), “ http://ieeexplore.ieee.org/document/1447203/ Very High Speed Computing Systems”, Proceedings of the IEEE

SIMD Implementations: Vector Machines The Cray-1 (circa 1977): • V x – vector registers • 64 elements • 64-bits per element • Vector length register (V len ) • Vector mask register Richard Russell, “The Cray-1 Computer System”, Comm. ACM 21,1 (Jan 1978), 63-72

6 = 2 3 4 5 6 2 3 1 8 5 7 + Vector Instructions – Vertical For 0 < i < V len : dst[i] = src1[i] + src2[i] • Most arithmetic instructions

1 2 3 4 1 = min( ) Vector Instructions – Horizontal For 0 < i < V len : dst = min(src1[i], dst) Note that dst is a scalar. • Mostly reductions (min, max, sum, etc.) • Not well supported • Cray-1 did not have this

1 dst 2 3 4 0 3 1 1 1 4 2 2 src mask Vector Instructions – Shuffle/Permute dst = shuffle(src1, mask) • Poor support on older implementations • Reasonably well-supported on recent implementations

7 5 src1 g5mask src1 = ? 14 ? 6 * 2 6 6 dst 2 2 4 1 0 1 0 1 2 7 5 src2 Masking/Predication g5mask = gt(src1, 5) dst = mul(src1, src2, g5mask)

MISD - ? Flynn, J., (1966), “ http://ieeexplore.ieee.org/document/1447203/ Very High Speed Computing Systems”, Proceedings of the IEEE

What type of machine is this? Hyperthreaded Core Different colours in RAM indicate different instruction streams. Source: https://en.wikipedia.org/wiki/Hyper-threading

What type of machine is this? GPU Each instruction is 32-wide. Source: https://devblogs.nvidia.com/inside-pascal/

What type of machine is this? TPU Matrix Multiply Unit Source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

TPU Overview Source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

Modern Multicores • Multiple Cores (MIMD) • (Short) Vector Instruction Sets (SIMD) • MMX, SSE, AVX (Intel) • 3DNow (AMD) • NEON (ARM)

Metrics we care about • Latency • Time to complete task • Lower is better • Throughput • Rate of completing tasks • Higher is better • Utilization • Time “worker” (processor, unit) is busy • Higher is better • Speedup • Higher is better

Reducing Latency • Use cheap operations • Which of these operations are expensive? • Bitshift • Integer Divide • Integer Multiply • Latency fundamentally bounded by physics

Increasing Throughput • Parallelize! • Lots of techniques, focus of this class • Add more processors • Need lots of work though to benefit

Speedup • Measure speedup w.r.t. fastest serial code • Not parallel program on 1 processor • Always report runtime • Never speedup alone • Are superlinear speedups possible?

CSC2/458 Parallel and Distributed Systems Machines and Models - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics Outline Recap Scalability Taxonomy of Parallel Machines

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems Consistency Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

CSC2/458 Parallel and Distributed Systems Parallel Data Structures - I Sreepathi Pai January 18,

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building Blocks Sreepathi Pai February 13,

CSC2/458 Parallel and Distributed Systems Distribute Computing Other Programming Models

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

CSC2/458 Parallel and Distributed Systems Mutual Exclusion and Leader Elections Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Consensus and Failures Sreepathi Pai April 10, 2018

CSC2/458 Parallel and Distributed Systems Automated Parallelization in Software Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Clocks Sreepathi Pai March 22, 2018 URCS Outline

CSC2/458 Parallel and Distributed Systems Termination Detection Sreepathi Pai April 12, 2018

CSC2/458 Parallel and Distributed Systems PPMI: Synchronization Preliminaries Sreepathi Pai

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

CSCI341 Lecture 38, Introduction to Multicore Architectures GOAL: PERFORMANCE Recall: Power as

Vectorization & Cache Organization ASD Shared Memory HPC Workshop Computer Systems Group

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

CS3350B Computer Organization Chapter 5: Parallel Architectures Alex Brandt Department of

Multithreaded Algorithms Architecture Evolution Weve come a long way since we blamed Von

For Friday BE ON TIME Bring two hard copies of your complete rough draft Be sure to

Searching for Solutions Artificial Intelligence CSPP 56553 January 14, 2004 Agenda Search

CS 188: Artificial Intelligence Spring 2007 Lecture 7: CSP-II and Adversarial Search 2/6/2007

CSC2/458 Parallel and Distributed Systems Machines and Models - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics Outline Recap Scalability Taxonomy of Parallel Machines

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems Consistency Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

CSC2/458 Parallel and Distributed Systems Parallel Data Structures - I Sreepathi Pai January 18,

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building Blocks Sreepathi Pai February 13,

CSC2/458 Parallel and Distributed Systems Distribute Computing Other Programming Models

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

CSC2/458 Parallel and Distributed Systems Mutual Exclusion and Leader Elections Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Consensus and Failures Sreepathi Pai April 10, 2018

CSC2/458 Parallel and Distributed Systems Automated Parallelization in Software Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Clocks Sreepathi Pai March 22, 2018 URCS Outline

CSC2/458 Parallel and Distributed Systems Termination Detection Sreepathi Pai April 12, 2018

CSC2/458 Parallel and Distributed Systems PPMI: Synchronization Preliminaries Sreepathi Pai

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

CSCI341 Lecture 38, Introduction to Multicore Architectures GOAL: PERFORMANCE Recall: Power as

Vectorization &amp; Cache Organization ASD Shared Memory HPC Workshop Computer Systems Group

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

CS3350B Computer Organization Chapter 5: Parallel Architectures Alex Brandt Department of

Multithreaded Algorithms Architecture Evolution Weve come a long way since we blamed Von

For Friday BE ON TIME Bring two hard copies of your complete rough draft Be sure to

Searching for Solutions Artificial Intelligence CSPP 56553 January 14, 2004 Agenda Search

CS 188: Artificial Intelligence Spring 2007 Lecture 7: CSP-II and Adversarial Search 2/6/2007

Vectorization & Cache Organization ASD Shared Memory HPC Workshop Computer Systems Group