lecture 1 cse 260 parallel computation fall 2015 scott b
play

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden - PowerPoint PPT Presentation

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to CSE 260! Your instructor is Scott Baden u baden+260@ucsd.edu u Office hours in EBU3B Room 3244 Today at 3.30, next week TBA (or make


  1. Lecture 1 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Introduction

  2. Welcome to CSE 260! • Your instructor is Scott Baden u baden+260@ucsd.edu u Office hours in EBU3B Room 3244 • Today at 3.30, next week TBA (or make appointment) • Your TA is Siddhant Arya u sarya@eng.ucsd.edu • The class home page is http://www.cse.ucsd.edu/classes/fa15/cse260-a • Course Resources u Piazza (Moodle for grades) u Stampede @ TACC Create an XSEDE portal account if you don’t have one u Bang @ UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 2

  3. What you’ll learn in this class • How to solve computationally intensive problems on parallel computers effectively: multicore processors, GPUs, clusters u Parallel programming: multithreading, message passing, vectorization, accelerator programming (OpenMP, CUDA, SIMD) u Parallel algorithms: discretization, sorting, linear algebra, sorting; communication avoiding algorithms (CA); irregular problems u Performance programming: latency hiding, managing locality within complicated memory hierarchies, load balancing, efficient data motion Scott B. Baden / CSE 260, UCSD / Fall '15 3

  4. Background • CSE 260 will build on your existing background, generalizing programming techniques, algorithm design and analysis • Background u Graduate standing u Recommended undergrad background: Computer Architecture & Operating Systems, C/C++ programming u I will level the playing field for non-CSE students; see me if you are unsure about your background • Your background u CSME? u Parallel computation? u Numerical analysis? Scott B. Baden / CSE 260, UCSD / Fall '15 4

  5. Background Markers • C/C++ Java Fortran? • TLB misses • MPI • RPC ∇ • u = 0 • Multithreading D ρ ( ) = 0 Dt + ρ ∇ • v • CUDA, GPUs • Abstract base class • Navier Stokes Equations f ( a ) " f ( a ) " " ( x − a ) 2 + ... f ( a ) + ( x − a ) + • Sparse factorization 1 ! 2! Scott B. Baden / CSE 260, UCSD / Fall '15 5

  6. Course Requirements • 5 assignments u Pre-survey and Registration: due Sunday @ 9pm u 3 Programming labs • Teams of 2, option to switch teams • Find a partner using the “looking for a partner” Moodle forum • Includes a lab report, greater emphasis (grading) with each lab u 1 in class test toward the end of the course Scott B. Baden / CSE 260, UCSD / Fall '15 6

  7. Text and readings • Required texts u Programming Massively Parallel Processors: A Hands- on Approach, 2 nd Ed., by David Kirk and Wen-mei Hwu, Morgan Kaufmann Publishers (201w) u An Introduction to Parallel Programming , by Peter Pacheco, Morgan Kaufmann (2011) • Assigned class readings will also include on-line materials • Lecture slides /classes/fa15/cse260-a/Lectures www. www.cse se.uc ucsd sd.edu/classes/fa15/cse260-a/Lectures Scott B. Baden / CSE 260, UCSD / Fall '15 7

  8. Policies • Academic Integrity u Do you own work u Plagiarism and cheating will not be tolerated • By taking this course, you implicitly agree to abide by the following the course polices: www.cse.ucsd.edu/classes/fa15/cse260-a/Policies.html .cse.ucsd.edu/classes/fa15/cse260-a/Policies.html Scott B. Baden / CSE 260, UCSD / Fall '15 8

  9. Classroom participation • Class participation is important to keep the lecture active • Consider the slides as talking points, class discussions driven by your interest • Complete the assigned readings before lecture and be prepared to discuss in class • Different lecture modalities u The 2 minute pause u In class problem solving Scott B. Baden / CSE 260, UCSD / Fall '15 9

  10. The 2 minute pause • Opportunity in class to develop your understanding, of lecture u By trying to explain to someone else u Getting your brain actively working on it • What will happen u I pose a question u You discuss with 1-2 people around you • Most important is your understanding of why the answer is correct u After most people seem to be done • I’ll ask for quiet • A few will share what their group talked about – Good answers are those where you were wrong, then realized… Scott B. Baden / CSE 260, UCSD / Fall '15 10

  11. An Introduction to Parallel Computation • Principles • Technological disruption and its impact • Motivation – applications Scott B. Baden / CSE 260, UCSD / Fall '15 11

  12. What is parallel processing? • We decompose a workload onto simultaneously executing physical processing resources to improve some aspect of performance u Speedup: 100 processors run × 100 faster than one u Capability: Tackle a larger problem, more accurately u Algorithmic, e.g. search u Locality: more cache memory and bandwidth • Multiple processors co-operate to process a related set of tasks – tightly coupled • Generally requires some form of communication and/or synchronization to manage the workload distribution Scott B. Baden / CSE 260, UCSD / Fall '15 12

  13. Have you written a parallel program? • Threads • MPI • RPC • C++11 Async • CUDA Scott B. Baden / CSE 260, UCSD / Fall '15 13

  14. The difference between Parallel Processing, Concurrency & Distributed Computing • Parallel processing u Performance (and capacity) is the main goal u More tightly coupled than distributed computation • Concurrency u Concurrency control: serialize certain computations to ensure correctness, e.g. database transactions u Performance need not be the main goal • Distributed computation u Geographically distributed u Multiple resources computing & communicating unreliably u “ Cloud ” computing, large amounts of storage, different from clusters in the cloud u Looser, coarser grained communication and synchronization • May or may not involve separate physical resources, e.g. multitasking “Virtual Parallelism” Scott B. Baden / CSE 260, UCSD / Fall '15 14

  15. Granularity • A measure of how often a computation communicates, and what scale u Distributed computer: a whole program u Multicomputer: function, a loop nest u Multiprocessor: + memory reference u Multicore: a single socket implementation of a multiprocessor u GPU: kernel thread u Instruction level parallelism: instruction, register Scott B. Baden / CSE 260, UCSD / Fall '15 15

  16. An Introduction to Parallel Computation • Principles • Technological disruption and its impact • Motivation – applications Scott B. Baden / CSE 260, UCSD / Fall '15 16

  17. Why is parallelism inevitable? • Physical limitations on heat dissipation impede processor clock speed increases • To make the processor faster, we replicate the computational elements Scott B. Baden / CSE 260, UCSD / Fall '15 18

  18. Technological trends of scalable HPC systems • Hybrid processors • Complicated software-managed parallel memory hierarchy • Memory/core is shrinking • Communication costs increasing relative to computational rate 2x/year 2x/ 3-4 years 60 Peak performance PFLOP /s PFLOP/s [Top500, 13] 40 20 0 2008 2009 2010 2011 2012 2013 Scott B. Baden / CSE 260, UCSD / Fall '15 21

  19. The age of the multi-core processor • On-chip parallel computer • IBM Power4 (2001), many others follow (Intel, AMD, Tilera, Cell Broadband Engine) • First dual core laptops (2005-6) • GPUs (nVidia, ATI): desktop supercomputer • In smart phones, behind the dashboard blog.laptopmag.com/nvidia-tegrak1-unveiled • Everyone has a parallel computer at their fingertips • If we don’t use parallelism, we lose it! realworldtech.com Scott B. Baden / CSE 260, UCSD / Fall '15 23 9/25/15 23

  20. The GPU • Specialized many-core processor • Massively multithreaded, long vectors • Reduced on-chip memory per core • Explicitly manage the memory hierarchy 1200 AMD (GPU) NVIDIA (GPU) 1000 Intel (CPU) Many-core GPU 800 GFLOPS 600 400 200 Multicore CPU Dual-core Quad-core 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year Courtesy: John Owens Christopher Dyken, SINTEF Scott B. Baden / CSE 260, UCSD / Fall '15 24 9/25/15 24

  21. Performance and Implementation Issues • To cope with growing data motion costs Processor (relative to computation) e c � Conserve locality n a m Memory r � Hide latency o (DRAM) f r e P • Little’s Law [1961] Year # threads = performance × latency T = p × λ p -1 λ � p and λ increasing with time p =1 - 8 flops/cycle λ = 500 cycles/word fotosearch.com fotosearch.com Scott B. Baden / CSE 260, UCSD / Fall '15 25 9/25/15 25

  22. Consequences of evolutionary disruption • Transformational: new capabilities for predictive modelling, healthcare… benefits to society • Changes the common wisdom for solving a problem including the implementation • Simplified processor design, but more user control over the hardware resources Scott B. Baden / CSE 260, UCSD / Fall '15 26

  23. Today’s mobile computer would have been yesterday ’ s supercomputer • Cray-1 Supercomputer • 80 MHz processor • 240 Mflops/sec peak • 3.4 Mflops Linpack • 8 Megabytes memory • Water cooled • 1.8m H x 2.2m W • 4 tons • Over $10M in 1976 www.anandtech.com/show/8716/apple-a8xs-gpu-gxa6850-even-better-than-i-thought Scott B. Baden / CSE 260, UCSD / Fall '15 27

Recommend


More recommend