GRADUATE FELLOW FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, May 11, 2017
GRADUATE FELLOWSHIP PROGRAM Funding for Ph.D. students revolutionizing disciplines with the GPU Engage: Build mindshare • Facilitate recruiting • Learn: • Keep a finger on the pulse of leading academic research • Keep up with all the applications that are powered by GPUs Leverage: Track relevant research • Help to guide researchers working on relevant problems • 2
GRADUATE FELLOWSHIP PROGRAM 144 Graduate Fellowships awarded -- $3.8M since program inception in 2002 Eligibility/Application Process: • Ph.D. candidates in at least their 2 nd year • Nomination(s) by Professor(s)/Advisor • 1-2 page research proposal Selection Process: • Committee of NVIDIA scientists and engineers review applications • Applications evaluated for originality, potential, and relevance 3
CURRENT 2016-2017 GRAD FELLOWS Saman Ashkiani, UC Davis Yong He, CMU Yatish Turakhia, Stanford Gang Wu, Univ of Sussex Minjie Wang, NYU Jiajun Wu, MIT NVIDIA Foundation Fellow 4
CURRENT 2016-2017 GRAD FELLOW FINALISTS Ahmed Elkholy, University of Illinois at Urbana-Champaign • • Achuta Kadambi, Massachusetts Institute of Technology Caroline Trippel, Princeton • Yu-Hang Tang, Brown University • Ling-Qi Yan, University of California at Berkeley • 5
6 Talks AGENDA 3 Minutes each 6
JIAJUN WU, MIT 7
SINGLE IMAGE 3D INTERPRETER NETWORK Jiajun Wu, May 11, 2017
GOAL 9
3D INTERPRETER NETWORK (3D-INN) 2D Keypoint Labels 2D Keypoint 3D 3D-to-2D Estimation Interpreter Projection Three-step training paradigm 10
3D INTERPRETER NETWORK (3D-INN) 2D Keypoint Estimation Three-step training paradigm I: 2D Keypoint Estimation 11
3D INTERPRETER NETWORK (3D-INN) 3D Interpreter Three-step training paradigm I: 2D Keypoint Estimation II: 3D Interpreter 12
3D INTERPRETER NETWORK (3D-INN) 2D Keypoint Labels 2D Keypoint 3D 3D-to-2D Estimation Interpreter Projection Three-step training paradigm I: 2D Keypoint Estimation II: 3D Interpreter III: End-to-end Finetuning 13
3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category 14
3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category Keypoint-5 dataset 15
3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category IKEA Dataset [Lim et al, ’13] 16
3D ESTIMATION: QUALITATIVE RESULTS SUN Training: our Keypoint-5 dataset, 2K images per category Input SUN Database [Xiao et al, ’11] After FT 17
3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category SUN Database [Xiao et al, ’11] 18
CHAIR EMBEDDING Manifold of chairs based on their inferred viewpoint 19
CONTRIBUTIONS OF 3D-INN Single image 3D perception Real 2D labels + synthetic 3D models, connected via keypoints A 3D-to-2D projection layer for end-to-end training 20
YATISH TURAKHIA, STANFORD 22
DARWIN: A GENOMICS CO-PROCESSOR Yatish Turakhia, 05/11/2017
GENOME ANALYSIS PIPELINE Patient Reads Genome (3 Billion base pairs) 1 2 ATGTCGAT REFERENCE:--ATGTC G ATGATCCAGAGGATA C TAGGATAT- CGATACGA Read assembly GAGTCATC PATIENT: --ATGTC A ATGAT - CAGAGGATA T TAGGATAT- DNA sequencer (Sequence alignment) ACTGACGT Mutations Find the disease- 3 causing mutation • Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have high error rate (15-40%) • >1,300 CPU hours for reference-guided assembly of noisy long reads • >15,600 CPU hours for de novo assembly of noisy long reads 24
DARWIN: SEQUENCE ALIGNMENT FRAMEWORK D-SOFT GACT Darwin D-SOFT GACT Query (Q) 40nm ASIC Query (Q) (Seed) (Extend) 300mm 2 , 9W D-SOFT API GACT API Reference (R) Reference (R) Software High speed and programmability 1. D-SOFT: Tunable speed/sensitivity to match different error profiles 2. GACT: First algorithm with O(1) memory for compute-intensive step of alignment allowing arbitrarily long alignments in hardware – well-suited to long reads 3. First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware 25
DARWIN: REFERENCE-GUIDED ASSEMBLY Reference genome ~3Gbp D-SOFT GACT Reads 1 st GACT tile Candidate alignment Extended GACT Seed hit ~10Kbp start locations (from D-SOFT) GACT tiles trace-back Score=7500 Score=60 Read Read ~10 6 Reference Reference 26
DARWIN: DE NOVO ASSEMBLY Reads GACT D-SOFT 1 st GACT tile Candidate alignment Extended GACT Seed hit (from D-SOFT) GACT tiles start locations trace-back Score=2500 Inferred ... Read ... Read overlap . . ... ... ... ... . . . . Reference Reference 40-100X speedup 6000X speedup 1. Sequential accesses to multiple 1. 512 Processing Elements (PEs) solving 3 DRAM channels dynamic programming equations every cycle 2. Random accesses using large on- 2. Trace-back pointers maintained in on-chip chip memory (64MB) memory (2KB/PE) 27
DARWIN PERFORMANCE Reference-guided assembly SENSITIVITY DARWIN READ TYPE ERROR RATE SPEEDUP SOFTWARE DARWIN Pacific Biosciences 15% 95.95% 99.91% 4,110X Oxford Nanopore 2D 30% 98.11% 98.40% 4,080X Oxford Nanopore 1D 40% 97.10% 97.40% 128X De novo assembly SENSITIVITY DARWIN READ TYPE ERROR RATE SPEEDUP SOFTWARE DARWIN Pacific Biosciences 15% 99.80% 99.89% 250X 28
THANK YOU!
SAMAN ASHKIANI, UC DAVIS 30
DYNAMIC DATA STRUCTURES FOR THE GPU Saman Ashkiani, 05/11/2017
DYNAMIC DATA STRUCTURES FOR THE GPU Objective: a general-purpose data structure that can be updated at runtime • • Supports updates (insert/deletion): batched or individual Efficient queries (lookup, count, range, etc.): batched or individual • • Motivation: more types of GPU data structures in programmer’s toolbox • It is a challenging task, because GPUs have thousands of parallel threads: need an efficient non-blocking data structure • • Most classic non-blocking ideas are hard to be efficiently implemented in SIMD fashion Efficient dynamic memory allocation is hard on GPUs • Safe memory reclamation: no dynamic memory management on GPUs • 32
OUR IMPLEMENTATIONS GPU LSM CONCURRENT HASH TABLE Hash table with chaining Dictionary data structure: multiple sorted arrays with different sizes • Updates: concurrent insertion/deletion (497 M updates/s on K40) • Updates: batch insertion/deletion • Queries: lookup (860 M queries/s on K40) (average: 225 M updates/s on K40) • Each bucket: a warp friendly linked list Queries: lookup, count, and range • (133, 60, and 30 M queries/s) • Warp-synchronous programming to better fit SIMD model Based on radix sort, merge, and • binary search • Our own dynamic memory allocator for nodes • Paper draft: http://ece.ucdavis.edu/~ashkiani/ • Memory reclamation: safe removal of gpu_lsm.pdf deleted nodes, for future reuse 33
GANG WU, UNIVERSITY OF SUSSEX 35
HIGH-SPEED FLUORESCENCE LIFETIME IMAGING BASED ON ANN AND GPU Gang Wu, 11 th May 2017
CONTENTS What is FLIM Project Aims FLIM Theories ANN-GPU-FLIM Results 37
WHAT IS FLIM FLIM Fluorescence-lifetime imaging microscopy is an technique for producing an image based on the differences in the exponential decay rate of the fluorescence from a fluorescent sample. Gold nanorods Applications Surgery guidance Disease therapies Disease diagnosis 38
PROJECT AIMS Current systems CPU based traditional FLIM analysis is very slow (tens of minutes for one image) Aims: high-speed FLIM analysis Fast algorithm ( Artificial neural network ) Highly paralleled hardware ( GPU ) 39
FLIM THEORIES Laser Lifetime TCSPC analysis GPU This work Sample Detector CPU with GUI 40
FLIM THEORIES 1.5 Photon counts 𝐵, 𝑔 𝐸 , 𝜐 𝐺 , 𝜐 𝐸 0.7 3 ns 0 2.2 Time bin Fluorescence decay histogram 41
ANN-GPU-FLIM ANN-GPU-FLIM principle FLIM Images FLIM data Artificial Neural Network 60 Photon count … 30 … … … 0 1 2 3 … 165 Once network training 166 42
RESULTS Accuracy performance Different optimized areas, comparable performance. Time performance SPEEDUP OVERALL ALGORITHMS IMAGE SIZE TIME-CPU (S) TIME-GPU (S) (GPU VS CPU) SPEEDUP ANN 0.89 0.1 8.9 256×256 415 LSM 41.5 3.8 10.8 43
YONG HE, CMU 45
EVOLVING SHADER COMPILATION FOR PERFORMANCE AND MAINTAINABILITY Yong He, May 2017
EVOLVING SHADER COMPILERS Meeting Performance Goals with Productivity Constraints Modern games feature increasingly more realistic graphics A game’s shader library has grown 100x more complex Shading language is still the same as ten years ago, lack functionality for achieving high performance without compromising code modularity and extensibility 47
AUTOMATIC APPROXIMATE SHADER COMPILATION Performance Productivity Fast Shader Fast Code Compilation A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015 48
Recommend
More recommend