Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden - PowerPoint PPT Presentation

Lecture 1 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Introduction

Welcome to CSE 260! • Your instructor is Scott Baden u baden+260@ucsd.edu u Office hours in EBU3B Room 3244 • Today at 3.30, next week TBA (or make appointment) • Your TA is Siddhant Arya u sarya@eng.ucsd.edu • The class home page is http://www.cse.ucsd.edu/classes/fa15/cse260-a • Course Resources u Piazza (Moodle for grades) u Stampede @ TACC Create an XSEDE portal account if you don’t have one u Bang @ UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 2

What you’ll learn in this class • How to solve computationally intensive problems on parallel computers effectively: multicore processors, GPUs, clusters u Parallel programming: multithreading, message passing, vectorization, accelerator programming (OpenMP, CUDA, SIMD) u Parallel algorithms: discretization, sorting, linear algebra, sorting; communication avoiding algorithms (CA); irregular problems u Performance programming: latency hiding, managing locality within complicated memory hierarchies, load balancing, efficient data motion Scott B. Baden / CSE 260, UCSD / Fall '15 3

Background • CSE 260 will build on your existing background, generalizing programming techniques, algorithm design and analysis • Background u Graduate standing u Recommended undergrad background: Computer Architecture & Operating Systems, C/C++ programming u I will level the playing field for non-CSE students; see me if you are unsure about your background • Your background u CSME? u Parallel computation? u Numerical analysis? Scott B. Baden / CSE 260, UCSD / Fall '15 4

Background Markers • C/C++ Java Fortran? • TLB misses • MPI • RPC ∇ • u = 0 • Multithreading D ρ ( ) = 0 Dt + ρ ∇ • v • CUDA, GPUs • Abstract base class • Navier Stokes Equations f ( a ) " f ( a ) " " ( x − a ) 2 + ... f ( a ) + ( x − a ) + • Sparse factorization 1 ! 2! Scott B. Baden / CSE 260, UCSD / Fall '15 5

Course Requirements • 5 assignments u Pre-survey and Registration: due Sunday @ 9pm u 3 Programming labs • Teams of 2, option to switch teams • Find a partner using the “looking for a partner” Moodle forum • Includes a lab report, greater emphasis (grading) with each lab u 1 in class test toward the end of the course Scott B. Baden / CSE 260, UCSD / Fall '15 6

Text and readings • Required texts u Programming Massively Parallel Processors: A Hands- on Approach, 2 nd Ed., by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (201w) u An Introduction to Parallel Programming , by Peter Pacheco, Morgan Kaufmann (2011) • Assigned class readings will also include on-line materials • Lecture slides /classes/fa15/cse260-a/Lectures www. www.cse se.uc ucsd sd.edu/classes/fa15/cse260-a/Lectures Scott B. Baden / CSE 260, UCSD / Fall '15 7

Policies • Academic Integrity u Do you own work u Plagiarism and cheating will not be tolerated • By taking this course, you implicitly agree to abide by the following the course polices: www.cse.ucsd.edu/classes/fa15/cse260-a/Policies.html .cse.ucsd.edu/classes/fa15/cse260-a/Policies.html Scott B. Baden / CSE 260, UCSD / Fall '15 8

Classroom participation • Class participation is important to keep the lecture active • Consider the slides as talking points, class discussions driven by your interest • Complete the assigned readings before lecture and be prepared to discuss in class • Different lecture modalities u The 2 minute pause u In class problem solving Scott B. Baden / CSE 260, UCSD / Fall '15 9

The 2 minute pause • Opportunity in class to develop your understanding, of lecture u By trying to explain to someone else u Getting your brain actively working on it • What will happen u I pose a question u You discuss with 1-2 people around you • Most important is your understanding of why the answer is correct u After most people seem to be done • I’ll ask for quiet • A few will share what their group talked about – Good answers are those where you were wrong, then realized… Scott B. Baden / CSE 260, UCSD / Fall '15 10

An Introduction to Parallel Computation • Principles • Technological disruption and its impact • Motivation – applications Scott B. Baden / CSE 260, UCSD / Fall '15 11

What is parallel processing? • We decompose a workload onto simultaneously executing physical processing resources to improve some aspect of performance u Speedup: 100 processors run × 100 faster than one u Capability: Tackle a larger problem, more accurately u Algorithmic, e.g. search u Locality: more cache memory and bandwidth • Multiple processors co-operate to process a related set of tasks – tightly coupled • Generally requires some form of communication and/or synchronization to manage the workload distribution Scott B. Baden / CSE 260, UCSD / Fall '15 12

Have you written a parallel program? • Threads • MPI • RPC • C++11 Async • CUDA Scott B. Baden / CSE 260, UCSD / Fall '15 13

The difference between Parallel Processing, Concurrency & Distributed Computing • Parallel processing u Performance (and capacity) is the main goal u More tightly coupled than distributed computation • Concurrency u Concurrency control: serialize certain computations to ensure correctness, e.g. database transactions u Performance need not be the main goal • Distributed computation u Geographically distributed u Multiple resources computing & communicating unreliably u “ Cloud ” computing, large amounts of storage, different from clusters in the cloud u Looser, coarser grained communication and synchronization • May or may not involve separate physical resources, e.g. multitasking “Virtual Parallelism” Scott B. Baden / CSE 260, UCSD / Fall '15 14

Granularity • A measure of how often a computation communicates, and what scale u Distributed computer: a whole program u Multicomputer: function, a loop nest u Multiprocessor: + memory reference u Multicore: a single socket implementation of a multiprocessor u GPU: kernel thread u Instruction level parallelism: instruction, register Scott B. Baden / CSE 260, UCSD / Fall '15 15

An Introduction to Parallel Computation • Principles • Technological disruption and its impact • Motivation – applications Scott B. Baden / CSE 260, UCSD / Fall '15 16

Why is parallelism inevitable? • Physical limitations on heat dissipation impede processor clock speed increases • To make the processor faster, we replicate the computational elements Scott B. Baden / CSE 260, UCSD / Fall '15 18

Technological trends of scalable HPC systems • Hybrid processors • Complicated software-managed parallel memory hierarchy • Memory/core is shrinking • Communication costs increasing relative to computational rate 2x/year 2x/ 3-4 years 60 Peak performance PFLOP /s PFLOP/s [Top500, 13] 40 20 0 2008 2009 2010 2011 2012 2013 Scott B. Baden / CSE 260, UCSD / Fall '15 21

The age of the multi-core processor • On-chip parallel computer • IBM Power4 (2001), many others follow (Intel, AMD, Tilera, Cell Broadband Engine) • First dual core laptops (2005-6) • GPUs (nVidia, ATI): desktop supercomputer • In smart phones, behind the dashboard blog.laptopmag.com/nvidia-tegrak1-unveiled • Everyone has a parallel computer at their fingertips • If we don’t use parallelism, we lose it! realworldtech.com Scott B. Baden / CSE 260, UCSD / Fall '15 23 9/25/15 23

The GPU • Specialized many-core processor • Massively multithreaded, long vectors • Reduced on-chip memory per core • Explicitly manage the memory hierarchy 1200 AMD (GPU) NVIDIA (GPU) 1000 Intel (CPU) Many-core GPU 800 GFLOPS 600 400 200 Multicore CPU Dual-core Quad-core 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year Courtesy: John Owens Christopher Dyken, SINTEF Scott B. Baden / CSE 260, UCSD / Fall '15 24 9/25/15 24

Performance and Implementation Issues • To cope with growing data motion costs Processor (relative to computation) e c � Conserve locality n a m Memory r � Hide latency o (DRAM) f r e P • Little’s Law [1961] Year # threads = performance × latency T = p × λ p -1 λ � p and λ increasing with time p =1 - 8 flops/cycle λ = 500 cycles/word fotosearch.com fotosearch.com Scott B. Baden / CSE 260, UCSD / Fall '15 25 9/25/15 25

Consequences of evolutionary disruption • Transformational: new capabilities for predictive modelling, healthcare… benefits to society • Changes the common wisdom for solving a problem including the implementation • Simplified processor design, but more user control over the hardware resources Scott B. Baden / CSE 260, UCSD / Fall '15 26

Today’s mobile computer would have been yesterday ’ s supercomputer • Cray-1 Supercomputer • 80 MHz processor • 240 Mflops/sec peak • 3.4 Mflops Linpack • 8 Megabytes memory • Water cooled • 1.8m H x 2.2m W • 4 tons • Over $10M in 1976 www.anandtech.com/show/8716/apple-a8xs-gpu-gxa6850-even-better-than-i-thought Scott B. Baden / CSE 260, UCSD / Fall '15 27

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden - PowerPoint PPT Presentation

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to CSE 260! Your instructor is Scott Baden u baden+260@ucsd.edu u Office hours in EBU3B Room 3244 Today at 3.30, next week TBA (or make

Lecture 3 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Address space

Lecture 12 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Stencil methods

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil

Lecture 10 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Looking at PTX code

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

Welcome to CSE 160! Introduction to parallel computation Scott B. Baden Welcome to Parallel

260 SOUTH STREET 1 260 SOUTH STREET NY, NY 260 SOUTH STREET NY, NY CB3 LAND USE COMMITTEE

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Cache Coherence Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Cache

Memory Consistency Models Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors Farzad Farshchi , Qijing

Comparing Time-Triggered Ethernet with Till Steinbach, Franz Korf, Thomas C. FlexRay: Schmidt

A Bandwidth-saving Optimization for MPI Broadcast Collective Operation Huan Zhou, Vladimir

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory Shaden Smith 1

MASTER'S THESIS Routing Protocols in Wireless Ad-hoc Networks - A Simulation Study Tony Larsson,

Control of large-scale systems with applications to water distribution and road traffic networks

T oke Hiland Jrgensen's PhD defense Introduction Luca Muscariello Cisco Principal Engineer

Future Scientific Possibilities in Neutron Scattering at the European Spallation Source for