Using HMM to Blur the Lines between CPU and GPU Programming John - PowerPoint PPT Presentation

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017

Heterogeneous Memory Management Overview 2

Agenda Overview Agenda for HMM: HMM Benefits SW-HW stack: where does HMM fit in? Heterogeneous Definitions Memory How HMM works Management Profiling with HMM A little bit of history References Conclusion 3

HMM Benefits 4

HMM Benefits Simpler code 5

Standard Unified Memory (CUDA 8.0) Unified Memory + HMM #include <stdio.h> #include <stdio.h> #define LEN sizeof(int) #define LEN sizeof(int) __global__ void __global__ void compute_this(int *pDataFromCpu) compute_this(int *pDataFromCpu) { { atomicAdd(pDataFromCpu, 1); atomicAdd(pDataFromCpu, 1); } } int main(void) int main(void) { { int *pData = NULL; int *pData = (int*)malloc(LEN); cudaMallocManaged(&pData, LEN); *pData = 1; *pData = 1; compute_this<<<512,1000>>>(pData); compute_this<<<512,1000>>>(pData); cudaDeviceSynchronize(); cudaDeviceSynchronize(); printf (“Results: %d \ n”, * pData); printf (“Results: %d \ n”, * pData); free(pData); cudaFree(pData); return 0; return 0; } } 6

HMM Benefits Simpler code Code is still tunable 7

Profiling with Unified Memory: Visual Profiler Source: https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal 8

HMM Benefits Simpler code Code is still tunable Libraries can be used without changing them 9

HMM Benefits Simpler code Code is still tunable Libraries can be used without changing them New programming languages are easily supported 10

SW-HW stack: where does HMM fit in? CUDA application libcudart libcuda User-space / Kernel boundary GPU driver Unified Memory driver (with HMM support) Linux kernel API GPU driver HMM API GPU hardware 11

Definitions Page: 4KB, 64KB, 2MB, Kernel: Linux operating etc.of physically OS: Operating System system internals (not a contiguous memory. CUDA kernel!) Smallest unit handled by the OS. 12

Definitions Page: 4KB, 64KB, 2MB, Kernel: Linux operating etc.of physically OS: Operating System system internals (not a contiguous memory. CUDA kernel!) Smallest unit handled by the OS. Page table: sparse tree containing virtual-to- physical address translations 15

Definitions Page: 4KB, 64KB, 2MB, Kernel: Linux operating etc.of physically OS: Operating System system internals (not a contiguous memory. CUDA kernel!) Smallest unit handled by the OS. Page table: sparse tree Page table entry: a containing virtual-to- single (page’s worth of) physical address virtual-to-physical translations translation 16

Definitions Page: 4KB, 64KB, 2MB, Kernel: Linux operating etc.of physically OS: Operating System system internals (not a contiguous memory. CUDA kernel!) Smallest unit handled by the OS. Page table: sparse tree Page table entry: a To map a (physical) containing virtual-to- single (page’s worth of) page: create a page physical address virtual-to-physical table entry for that translations translation page. 17

Definitions Page: 4KB, 64KB, 2MB, Kernel: Linux operating etc.of physically OS: Operating System system internals (not a contiguous memory. CUDA kernel!) Smallest unit handled by the OS. Page table: sparse tree Page table entry: a To map a (physical) containing virtual-to- single (page’s worth of) page: create a page physical address virtual-to-physical table entry for that translations translation page. Unmap: remove a page table entry. Subsequent program accesses will cause page faults. 18

Definitions Page: 4KB, 64KB, 2MB, Kernel: Linux operating etc.of physically OS: Operating System system internals (not a contiguous memory. CUDA kernel!) Smallest unit handled by the OS. Page table: sparse tree Page table entry: a To map a (physical) containing virtual-to- single (page’s worth of) page: create a page physical address virtual-to-physical table entry for that translations translation page. Unmap: remove a page Page fault: a CPU (or table entry. GPU) exception caused Subsequent program by a missing page table accesses will cause entry for a virtual page faults. address. 19

Definitions Page: 4KB, 64KB, 2MB, Kernel: Linux operating etc.of physically OS: Operating System system internals (not a contiguous memory. CUDA kernel!) Smallest unit handled by the OS. Page table: sparse tree Page table entry: a To map a (physical) containing virtual-to- single (page’s worth of) page: create a page physical address virtual-to-physical table entry for that translations translation page. Unmap: remove a page Page fault: a CPU (or Page migration: unmap table entry. GPU) exception caused a page from CPU, copy Subsequent program by a missing page table to GPU, map on GPU accesses will cause entry for a virtual (or the reverse). Also page faults. address. GPU-to-GPU. 20

How HMM works - 1 CPU page fault Migrate Migrate to GPU to CPU GPU page fault 21

How HMM works - 2 CPU page fault occurs HMM receives page fault, calls UM driver UM copies page data to GPU, unmaps from GPU HMM maps page to CPU OS kernel resumes CPU code 22

How HMM works - 3 GPU page fault occurs UM driver receives page fault UM driver fails to find page in its records UM asks HMM about the page, HMM has a malloc record of the page UM tells HMM that page will be migrated from CPU to GPU HMM unmaps page from CPU UM copies page data to GPU UM causes GPU to resume execution (“replays” the page fault) 23

Profiling with Unified Unified Memory + HMM Memory + HMM #include <stdio.h> #define LEN sizeof(int) __global__ void compute_this(int *pDataFromCpu) This is the code that we are profiling, { in the next slide: atomicAdd(pDataFromCpu, 1); } int main(void) { int *pData = (int*)malloc(LEN); *pData = 1; compute_this<<<512,1000>>>(pData); cudaDeviceSynchronize(); printf (“Results: %d \ n”, * pData); free(pData); return 0; } 24

Profiling with Unified Memory + HMM: nvprof $ /usr/local/cuda/bin/ nvprof --unified-memory-profiling per-process-device ./hmm_app ==19835== NVPROF is profiling process 19835, command: ./hmm_app Results: 512001 ==19835== Profiling application: ./hmm_app ==19835== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00% 1.2904ms 1 1.2904ms 1.2904ms 1.2904ms compute_this(int*) ==19835== Unified Memory profiling result: Device "GeForce GTX 1050 Ti (0)" Count Avg Size Min Size Max Size Total Size Total Time Name 2 32.000KB 4.0000KB 60.000KB 64.00000KB 42.62400us Host To Device 2 32.000KB 4.0000KB 60.000KB 64.00000KB 37.98400us Device To Host 1 - - - - 1.179410ms GPU Page fault groups Total CPU Page faults: 2 ==19835== API calls: Time(%) Time Calls Avg Min Max Name 98.88% 388.41ms 1 388.41ms 388.41ms 388.41ms cudaMallocManaged 0.39% 1.5479ms 190 8.1470us 768ns 408.58us cuDeviceGetAttribute 0.33% 1.3125ms 1 1.3125ms 1.3125ms 1.3125ms cudaDeviceSynchronize 0.19% 739.71us 2 369.86us 363.81us 375.90us cuDeviceTotalMem 0.13% 524.45us 1 524.45us 524.45us 524.45us cudaFree 0.04% 137.87us 1 137.87us 137.87us 137.87us cudaLaunch 0.03% 126.84us 2 63.417us 58.109us 68.726us cuDeviceGetName 0.00% 11.524us 1 11.524us 11.524us 11.524us cudaConfigureCall 0.00% 6.4950us 1 6.4950us 6.4950us 6.4950us cudaSetupArgument 0.00% 6.2160us 6 1.0360us 768ns 1.2570us cuDeviceGet 0.00% 4.5400us 3 1.5130us 838ns 2.6540us cuDeviceGetCount 25

Typical Bandwidths, in GB/s 800 700 600 500 400 750 300 200 100 96 80 12 0 CPU: DDR4, local access GPU: Pascal, local PCIe 3.0 NVLink 1.0 access Bandwidth 26

Tuning still works cudaMemPrefetchAsync: this is the new cudaMemcpy cudaMemAdvise cudaMemAdviseSetReadMostly cudaMemAdviseSetPreferredLocation cudaMemAdviseSetAccessedBy 27

Profiling with Unified Memory: Visual Profiler Source: https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal 28

HMM History 29

HMM History Prehistoric: Pascal replayable page faulting hardware is envisioned and spec’d out 2012: discussions with Red Hat, Jerome Glisse begin April, 2014: CUDA 6.0: First ever release of Unified Memory, CPU page faults but no GPU page faults. Works surprisingly well… May, 2014: HMM v1 posted to linux-mm and linux-kernel November , 2014: HMM patchset review: Linus Torvalds: “NONE OF WHAT YOU SAY MAKES ANY SENSE” Mid-2016: Pascal GPUs become available (a Linux kernel prerequisite) March, 2017: linux-mm summit: HMM a major topic of discussion May, 2017: HMM v21 posted (3 year anniversary) 30

References https://devblogs.nvidia.com/parallelforall/inside-pascal/ https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/ http://docs.nvidia.com/cuda/cuda-c-programming-guide http://www.spinics.net/lists/linux-mm/msg126148.html (HMM v21 patchset) 31

Conclusion 32

Using HMM to Blur the Lines between CPU and GPU Programming John - PowerPoint PPT Presentation

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017 Heterogeneous Memory Management Overview 2 Agenda Overview Agenda for HMM: HMM Benefits SW-HW stack: where does HMM fit in? Heterogeneous Definitions

Motion Blur Detection Ben Simandoyev & Keren Damari Blur in Images There are two main types

Learning to Synthesize Motion Blur CVPR 2019 Tim Brooks and Jon Barron Research Motion During

Fixing the Gaussian Blur: the Bilateral Filter Sylvain Paris MIT CSAIL Blur Comes from

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

blur Melissa Kaufman-Gomez | Team Leader Timothy Goodwin | System Architect Dexter Callender |

Lens Blur Skill in Photoshop Why to Blur in Photoshop ? Focus on your subject

Fixing the Gaussian Blur: the Bilateral Filter Sylvain Paris MIT CSAIL Blur Comes from

Product Lines that supply Product Lines that supply other Product Lines: other Product Lines: A

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Motion in Photography Freeze Motion / Blur Motion Objective The student will create freeze

Kernel Operations Assignment 02 Graphics Programming By Akarsh Kumar Images Chosen Box

Citywide Survey Informational presentation February 20 th , 2019 Historic Preservation Commission

Office of Planning Andrew Trueblood, Director Comprehensive Plan: Public Draft Amendment Release

Characterization of Indianas Coal Resource: Availability of the Reserves, Physical and

JEFFERSON COUNTY Comprehensive Plan and Agricultural Preservation and Land Use Plan STEERING

BUILDING A SUPER RESOLUTION VIDEO COMPOSITOR Thomas True, March 18. 2019 Motivation Building

vulnerability in urban area Olivier SANTONI FERDI AREQUIPA - May 2017 Risk assessment

Community Mapping Creating the Evidence Base Citizen Science & Participatory Mapping DR

Physical Security Briefing for Senate IT November 25, 2019 Presented by: Joe Souza, Director

Using HMM to Blur the Lines between CPU and GPU Programming John - PowerPoint PPT Presentation

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017 Heterogeneous Memory Management Overview 2 Agenda Overview Agenda for HMM: HMM Benefits SW-HW stack: where does HMM fit in? Heterogeneous Definitions

Motion Blur Detection Ben Simandoyev &amp; Keren Damari Blur in Images There are two main types

Learning to Synthesize Motion Blur CVPR 2019 Tim Brooks and Jon Barron Research Motion During

Fixing the Gaussian Blur: the Bilateral Filter Sylvain Paris MIT CSAIL Blur Comes from

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

blur Melissa Kaufman-Gomez | Team Leader Timothy Goodwin | System Architect Dexter Callender |

Lens Blur Skill in Photoshop Why to Blur in Photoshop ? Focus on your subject

Fixing the Gaussian Blur: the Bilateral Filter Sylvain Paris MIT CSAIL Blur Comes from

Product Lines that supply Product Lines that supply other Product Lines: other Product Lines: A

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Motion in Photography Freeze Motion / Blur Motion Objective The student will create freeze

Kernel Operations Assignment 02 Graphics Programming By Akarsh Kumar Images Chosen Box

Citywide Survey Informational presentation February 20 th , 2019 Historic Preservation Commission

Office of Planning Andrew Trueblood, Director Comprehensive Plan: Public Draft Amendment Release

Characterization of Indianas Coal Resource: Availability of the Reserves, Physical and

JEFFERSON COUNTY Comprehensive Plan and Agricultural Preservation and Land Use Plan STEERING

BUILDING A SUPER RESOLUTION VIDEO COMPOSITOR Thomas True, March 18. 2019 Motivation Building

vulnerability in urban area Olivier SANTONI FERDI AREQUIPA - May 2017 Risk assessment

Community Mapping Creating the Evidence Base Citizen Science &amp; Participatory Mapping DR

Physical Security Briefing for Senate IT November 25, 2019 Presented by: Joe Souza, Director

Motion Blur Detection Ben Simandoyev & Keren Damari Blur in Images There are two main types

Community Mapping Creating the Evidence Base Citizen Science & Participatory Mapping DR