CS 5220: Single core architecture David Bindel 2017-08-29 1 Just - PowerPoint PPT Presentation

CS 5220: Single core architecture David Bindel 2017-08-29 1

Just for fun http://www.youtube.com/watch?v=fKK933KK6Gg Is this a fair portrayal of your CPU? (See Rich Vuduc’s talk, “Should I port my code to a GPU?”) 2

The idealized machine • Address space of named words • Basic operations are register read/write, logic, arithmetic • Everything runs in the program order • All operations take about the same amount of time 3 • High-level language → “obvious” machine code

The real world • Memory operations are not all the same! • Registers and caches lead to variable access speeds • Different memory layouts dramatically affect performance • Instructions are non-obvious! • Pipelining allows instructions to overlap • Functional units run in parallel (and out of order) • Instructions take different amounts of time • Different costs for different orders and instruction mixes Our goal: enough understanding to help the compiler out. 4

Prelude We hold these truths to be self-evident: 1. One should not sacrifice correctness for speed 2. One should not re-invent (or re-tune) the wheel 3. Your time matters more than computer time Less obvious, but still true: 1. Most of the time goes to a few bottlenecks 2. The bottlenecks are hard to find without measuring 3. Communication is expensive (and often a bottleneck) 4. A little good hygiene will save your sanity • Automate testing, time carefully, and use version control 5

A sketch of reality Today, a play in two acts: 1 1. Act 1: One core is not so serial 2. Act 2: Memory matters 1 If you don’t get the reference to This American Life , go find the podcast! 6

Act 1 One core is not so serial. 7

Parallel processing at the laundromat • Three stages to laundry: wash, dry, fold. • How long will this take? 8 • Three loads: darks, lights, underwear

Parallel processing at the laundromat Dinner? 2 3 4 5 wash dry fold wash • Pipeline version: dry fold Cat videos? wash dry fold Gym and tanning? 1 fold • Serial version: 8 1 2 3 4 5 6 7 9 dry wash dry fold wash dry fold wash 9

Pipelining • Pipelining improves bandwidth , but not latency • Potential speedup = number of stages • But what if there’s a branch? • Different pipelines for different functional units • Front-end has a pipeline • Functional units (FP adder, FP multiplier) pipelined • Divider is frequently not pipelined 10

Out-of-order execution Modern CPUs are wide and out-of-order : • Wide: Fetch/decode or retire multiple ops at once • Limits: Instruction mix (different ports for different ops) • NB: May dynamically translate to micro-ops • Out-of-order: Looks in-order, internally not! • Limits: Data dependencies • Details are very hard to work out manually • Don’t generally know the micro-op breakdown! • Tricky to think through even if we did • Compilers help a lot with this • But they need a good mix of independent ops 11

SIMD • S ingle I nstruction M ultiple D ata • Old idea had a resurgence in mid-late 90s (for graphics) • Now short vectors are ubiquitous... • Totient CPUs: 256 bits (four doubles) in a vector (AVX) • Totient accel: 512 bits (eight doubles) in a vector (AVX-512) • And then there are GPUs! • Alignment often matters 12 • Cray-1 (1976): 8 registers × 64 words of 64 bits each

Example: My laptop MacBook Pro (Retina, 13 in, late 2013). • Intel Core i5-4288U CPU at 2.6 GHz. 2 core / 4 thread. • AVX units provide up to 8 double flops/cycle (Simultaneous vector add + vector multiply) • Wide dynamic execution: up to four full instructions at once • Haswell has two FMA ports, so can retire two at a time • Operations internally broken down into “micro-ops” • Cache micro-ops – like a hardware JIT?! Theoretical peak: 83.2 GFlop/s? 13

Punchline • Special features: SIMD instructions, maybe FMAs, ... • Compiler understands how to utilize these in principle • Rearranges instructions to get a good mix • Tries to make use of FMAs, SIMD instructions, etc • In practice, needs some help: • Set optimization flags, pragmas, etc • Rearrange code to make things obvious and predictable • Use special intrinsics or library routines • Choose data layouts, algorithms that suit the machine • Goal: You handle high-level, compiler handles low-level. 14

Act 2 Memory matters. 15

My machine • Theoretical peak flop rate: 83.2 GFlop/s • Peak memory bandwidth: 25.6 GB/s • Arithmetic intensity = flops / memory accesses • Example: Sum several million doubles (AI = 1) – how fast? • So what can we do? Not much if lots of fetches, but... 16

Cache basics Programs usually have locality • Spatial locality : things close to each other tend to be accessed consecutively • Temporal locality : use a “working set” of data repeatedly Cache hierarchy built to use locality. 17

Cache basics • Memory latency = how long to get a requested item • Memory bandwidth = how fast memory can provide data • Bandwidth improving faster than latency Caches help: • Hide memory costs by reusing data • Exploit temporal locality • Use bandwidth to fetch a cache line all at once • Exploit spatial locality • Use bandwidth to support multiple outstanding reads • Overlap computation and communication with memory • Prefetching This is mostly automatic and implicit. 18

Cache basics • Direct-mapped: each address can only go in one cache Higher associativity is more expensive. cache location 1101). locations (store up to 16 words with addresses xxxx1101 at • n -way: each address can go into one of n possible cache 1101) location (e.g. store address xxxx1101 only at cache location • Associativity • Store cache line s of several bytes • Conflict miss: insufficient associativity for access pattern was last used – working set too big • Capacity miss: filled the cache with other things since this • Compulsory miss: never used this data before • Cache miss otherwise. Three basic types: • Cache hit when copy of needed data in cache 19

Teaser centroid. Which of these is faster and why? x i , then sum the y i . Let’s see! 20 We have N = 10 6 two-dimensional coordinates, and want their 1. Store an array of ( x i , y i ) coordinates. Loop i and simultaneously sum the x i and the y i . 2. Store an array of ( x i , y i ) coordinates. Loop i and sum the x i , then sum the y i in a separate loop. 3. Store the x i in one array, the y i in a second array. Sum the

Caches on my laptop (I think) • 32 KB L1 data and memory caches (per core), 8-way associative • 256 KB L2 cache (per core), 8-way associative • 3 MB L3 cache (shared by all cores) 21

A memory benchmark (membench) for array A of length L from 4 KB to 8MB by 2x for stride s from 4 bytes to L/2 by 2x time the following loop for i = 0 to L by s load A[i] from memory 22

membench on my laptop – what do you see? 23 4.0K 8.0K 30 16.0K 32.0K 25 64.0K 128.0K 256.0K 20 512.0K Time (ns) 1.0M 2.0M 15 4.0M 8.0M 10 16.0M 32.0M 64.0M 5 0 2 3 2 6 2 9 2 12 2 15 2 18 2 21 2 24 Stride (bytes)

membench on my laptop – what do you see? 24 5 10 15 20 25 30 26 25 24 22 20 log2(size) 20 15 18 16 10 14 5 12 log2(stride)

membench on my laptop – what do you see? • Vertical: 64B line size (2 5 ), 4K page size (2 12 ) • Diagonal: 8-way cache associativity, 512 entry L2 TLB • Horizontal: 32K L1 (2 15 ), 256K L2 (2 18 ), 6 MB L3 25 5 10 15 20 25 30 26 25 24 22 20 log2(size) 20 15 18 16 10 14 5 12 log2(stride)

membench on Totient – what do you see? 26 5 10 15 20 25 26 20 24 22 15 log2(size) 20 18 10 16 14 5 12 log2(stride)

The moral Even for simple programs, performance is a complicated function of architecture! • Need to understand at least a little to write fast programs • Would like simple models to help understand efficiency • Would like common tricks to help design fast codes • Example: blocking (also called tiling ) 27

Coda The Roofline Model. 28

Roofline model S. Williams, A. Waterman, D. Patterson, “Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures,” CACM, April 2009. 29

Roofline plot basics Log-log plot (base 2) • x : Operational intensity (flops/byte) • y : Attainable performance (GFlop/s) • Diagonals: Memory limits • Horizontals: Compute limits • Papers: https://crd.lbl.gov/departments/ computer-science/PAR/research/roofline/ • Tools: https://bitbucket.org/berkeleylab/ cs-roofline-toolkit 30

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just - PowerPoint PPT Presentation

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just for fun http://www.youtube.com/watch?v=fKK933KK6Gg Is this a fair portrayal of your CPU? (See Rich Vuducs talk, Should I port my code to a GPU?) 2 The idealized

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single

Single-Source Architecture Principles Single-Source Architecture is strategy for building websites

CS 5220: Impact of Floating Point David Bindel 2017-11-16 1 Why this lecture? Isnt this

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: Modern processors Modern

Introduction to Compute Cloud Tao Zou CS 5220 Applications of Parallel Computers About me 3

CS 5220: VMs, containers, and clouds David Bindel 2017-10-12 1 Cloud vs HPC Is the cloud

Dave DeFazio Partner dave.defazio@strategycorps.com 615-498-5220 linkedin.com/in/davedefazio

CS 5220: Mixed languages, libraries, and frameworks David Bindel 2017-11-21 1 Nerdvana? x =

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on the Soap Box The goal is

CS 5220: Locality and parallelism in simulations II David Bindel 2017-09-14 1 Basic styles of

Continuous models of computation: computability, complexity, universality Amaury Pouly Joint

Analysis of ECC Implementations with Worst-Case Horizontal Attacks Romain Poussier,

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel

CICM 2018: PC Chair Report Florian Rabe Universities of Erlagen-Nuremberg and Paris-Sud PC Chair

Fractional Diffusion Equations IMPA November 1 2017 William Rundell Texas A&M University

Bootstrapping Solr search clusters and maintain them using Puppet All you ever wanted to know

TeaStore A Micro-Service Application for Benchmarking, Modeling and Resource Management Research

Informatics 2A: Processing Formal Bonnie Webber, bonnie@inf.ed.ac.uk, Office Hour, Tues

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just - PowerPoint PPT Presentation

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just for fun http://www.youtube.com/watch?v=fKK933KK6Gg Is this a fair portrayal of your CPU? (See Rich Vuducs talk, Should I port my code to a GPU?) 2 The idealized

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single

Single-Source Architecture Principles Single-Source Architecture is strategy for building websites

CS 5220: Impact of Floating Point David Bindel 2017-11-16 1 Why this lecture? Isnt this

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: Modern processors Modern

Introduction to Compute Cloud Tao Zou CS 5220 Applications of Parallel Computers About me 3

CS 5220: VMs, containers, and clouds David Bindel 2017-10-12 1 Cloud vs HPC Is the cloud

Dave DeFazio Partner dave.defazio@strategycorps.com 615-498-5220 linkedin.com/in/davedefazio

CS 5220: Mixed languages, libraries, and frameworks David Bindel 2017-11-21 1 Nerdvana? x =

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on the Soap Box The goal is

CS 5220: Locality and parallelism in simulations II David Bindel 2017-09-14 1 Basic styles of

Continuous models of computation: computability, complexity, universality Amaury Pouly Joint

Analysis of ECC Implementations with Worst-Case Horizontal Attacks Romain Poussier,

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel

CICM 2018: PC Chair Report Florian Rabe Universities of Erlagen-Nuremberg and Paris-Sud PC Chair

Fractional Diffusion Equations IMPA November 1 2017 William Rundell Texas A&amp;M University

Bootstrapping Solr search clusters and maintain them using Puppet All you ever wanted to know

TeaStore A Micro-Service Application for Benchmarking, Modeling and Resource Management Research

Informatics 2A: Processing Formal Bonnie Webber, bonnie@inf.ed.ac.uk, Office Hour, Tues

Fractional Diffusion Equations IMPA November 1 2017 William Rundell Texas A&M University