Towards Accelerator-Rich Architectures and Systems Zhenman Fang, Postdoc Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research https://sites.google.com/site/fangzhenman/
The Trend of Accelerator-Rich Chips Increasing # of Accelerators Die photo of Apple A8 SoC in Apple SoC (Estimated) [www.anandtech.com/show/8562/chipworks-a8] 40 35 # of IP Blocks 30 Specialized Accelerators, Fixed-function accelerators (ASIC): 25 Application-Specific Integrated Circuit e.g., audio, video, face, imaging, DSP, … instead of general-purpose processors 20 15 10 5 0 A4 A5 A6 A7 A8 A9 A10 2010 2016 GPU CPU Maltiel Consulting Harvard’s estimates estimates [Shao, IEEE Micro'15] 2
The Trend of Accelerator-Rich Cloud Cloud service providers begin to deploy FPGAs in their datacenters 2x throughput [Putnam, FPGA ISCA'14] improvement! Field-Programmable Gate Array (FPGA) accelerators ü Reconfigurable commodity HW ü Energy-efficient, a high-end board costs ~25W 3
The Trend of Accelerator-Rich Cloud Cloud service providers begin to deploy FPGAs in their datacenters 2x throughput [Putnam, FPGA ISCA'14] improvement! Accelerators are becoming 1 st class citizens § Intel expectation: 30% datacenter nodes with FPGAs by 2020, after the $16.7 billion acquisition of Altera 4
Post-Moore Era: Potential for Customized Accelerators Accelerators promise 10X -1000x gains of performance per watt by trading off flexibility for performance! FPGAs ASICs Moore’s law dead! Source: Bob Broderson, Berkeley Wireless group 5
Challenges in Making Accelerator-Rich Architectures and Systems Mainstream How to characterize and accelerate killer applications? How to efficiently integrate accelerators into future chips? § E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17] “Extended” Amdahl’s law: 𝐩𝐰𝐟𝐬𝐛𝐦𝐦_𝐭𝐪𝐟𝐟𝐞𝐯𝐪 = 𝟐 𝒍𝒇𝒔𝒐𝒇𝒎% 𝒃𝒅𝒅_𝒕𝒒𝒇𝒇𝒆𝒗𝒒 + 𝟐 − 𝒍𝒇𝒔𝒐𝒇𝒎% + 𝒋𝒐𝒖𝒇𝒉𝒔𝒃𝒖𝒋𝒑𝒐 Accelerator CPU Integration overhead 6
Challenges in Making Accelerator-Rich Architectures and Systems Mainstream How to characterize and accelerate killer applications? How to efficiently integrate accelerators into future chips? § E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17] How to deploy commodity accelerators in big data systems? § E.g., a naïve integration may lead to 1000x slowdown [HotCloud'16] How to program such architectures and systems? 7
Overview of My Research • Application Drivers 1 • Workload characterization and acceleration • Compiler Support 4 • From many-core to accelerator-rich architectures • Accelerator-Rich Systems 3 • Accelerator-as-a-Service ( AaaS ) in cloud deployment • Accelerator-Rich Architectures (ARA) 2 • Modeling and optimizing CPU-Accelerator interaction 8
Dimension #1: Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] ü Analysis and combination of task, ü 2.6x speedup for in-memory ü Caffeine: FPGA engine for Caffe pipeline, and data parallelism genome sort (Samtool) ü 1.46 TOPS for 8-bit Conv layer ü 13x speedup on 16-core CPU ü Record 9.6GB/s throughput ü 100x speedup for FCN layer for genome compression ü 46x speedup on GPU ü 5.7x energy savings over GPU on Intel-Altera HARPv2; 50x speedup over ZLib How do accelerators achieve such speedup? 9
For Review Only Dimension #1: Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] K 1 K 2 N Kernel: Convolutional #2: Customized pipeline #6: Precision customization X X X Out [ m ][ r ][ c ] = W [ m ][ n ][ i ][ j ] ∗ In [ n ][ S 1 ∗ r + i ][ S 2 ∗ c + j ]; Matrix-Multiplication n =0 i =0 j =0 E.g., convolutional Input 0 Input 1 DRAM accelerator on-chip #4: Double Input #3: Parallelization buffer X + Weight 0 + Output 0 Weight X + Weight 1 #1: Caching #5: DRAM Output X + Weight 2 re-organization + Output 1 X + Weight 3 10
Dimension #1: Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] 100 1.46 TOPS 100 80 68.3 GFLOPS 60 ü Programmed in Xilinx 36.5 40 High-Level Synthesis (HLS) ü Results collected on Alpha 20 7.6 3.2 1.8 0.005 Data PCIe-7v3 FPGA board 0 11
Dimension #2: Accelerator-Rich Architectures (ARA) GAM: Global SPM: ScratchPad ISA Extension Acc Manager Memory C 1 C m GAM Acc 1 Acc n SPM SPM SPM TLB L1 L1 DMA DMA DMA Customizable Network-on-Chip Shared LLC Cache DRAM Controller Overview of Accelerator-Rich Architecture 12
Dimension #2: Accelerator-Rich Architectures (ARA) ARA Modeling: ARA Optimization: ü PARADE simulator: ü Sources of accelerator PARADE is open source: gem5 + HLS [ICCAD'15] http://vast.cs.ucla.edu/ gains [FCCM'16] software/parade-ara-simulator ü Fast ARAPrototyper flow ü CPU-Acc co-design: on FPGA-SoC [arXiv'16] address translation for unified memory space, C 1 C m 7.6x speedup, 6.4% gap GAM Acc 1 Acc n Multicore Modeling: to ideal [HPCA'17 best SPM SPM SPM TLB L1 L1 paper nominee] ü Transformer simulator DMA DMA DMA ü AIM: near memory [DAC'12, LCTES'12] acceleration gives Customizable Network-on-Chip another 4x speedup Shared LLC Cache DRAM Controller [Memsys'17] More information in ISCA'15 & MICRO'16 tutorials: http://accelerator.eecs.harvard.edu/micro16tutorial/ 13
Dimension #3: Accelerator-Rich Systems Cloud service provider w/ accelerator-enabled cloud Big data application Accelerator designer developer (e.g., Spark) (e.g., FPGA) Easy and efficient Easy accelerator accelerator invocation registration into cloud and sharing Blaze prototype: Accelerator-as-a-service CPU-FPGA platform 1 server w/ FPGA choice [DAC'16] : Blaze works with Spark and YARN ~= 3 CPU servers 1) mainstream PCIe, or and is open source: 2) coherent PCIe (CAPI), or https://github.com/UCLA-VAST/blaze FPGA 3) Intel-Altera HARP [HotCloud'16, ACM SOCC'16] (coherent, one package) 14
Dimension #4: Compiler Support source-to-source compiler for mem system coordinated data prefetching: improvement 1.5x speedup on Xeon Phi [ICS'14, TACO'15, ongoing] many-core processor Future work Accelerator-Rich Architectures & Systems 15
Overview of My Research Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] mem system Compiler Accelerator-Rich improvement Support Architectures & Systems [ICS'14, TACO'15, ongoing] Accelerator-Rich FPGA CPU-FPGA: PCIe or QPI? Systems AaaS: Blaze [HotCloud'16, ACM SOCC'16] [DAC'16] Accelerator-Rich PARADE [ICCAD'15] sources of gains [FCCM' 1 6] ARAPrototyper [arXiv' 1 6] Architectures CPU-Acc address translation [HPCA'17 best paper nominee] tutorials [ISCA' 1 5 Transformer [DAC'12, LCTES'12] & MICRO' 1 6] near mem acceleration Tool: System-Level Automation [Memsys'17] 16
Chip-Level CPU-Accelerator Co-design: Address Translation for Unified Memory Space [HPCA'17 Best Paper Nominee] Better programmability and performance 17
Virtual Memory and Address Translation 101 Virtual memory and its benefits § Shared memory for multi-process § Memory isolation for security Address translation § Conceptually more memory Core Translation Lookaside Buffer (TLB): Virtual memory Physical cache address translation results (per process) memory TLB Memory Management Unit (MMU): MMU virtual-to-physical address translation Memory Page Table Virtual-to-physical address mapping in page granularity 18
Inefficiency in Today’s ARA Address Translation Today’s ARA Core Core Accel Accel address translation Accel Datapath TLB TLB using IOMMU with IOMMU Scratchpad IOTLB ( e.g. 32-entries ) MMU MMU DMA Interconnect 1 Ideal Address Translation Performance Relative to IOMMU IOTL IOTLB IOMMU Main Memory 0.8 IOMMU only achieves 0.6 Accelerator-Rich Architecture (ARA) 12% performance of ideal 0.4 address translation #1 Inefficient TLB Support. 0.2 TLBs are not specialized to provide low-latency and capture page locality 0 Deblur Denoise Regist. Segment. Black. Stream. Swapt. DispMap LPCIP EKFSLAM RobLoc gmean #2 High Page Walk Latency. On an IOTLB miss, 4 main memory accesses are required to walk page table Medical Imaging Commercial Vision Navigation 19
Accelerator Performance Is Highly Sensitive to Address Translation Latency 1 gmean Performance Relative to Ideal Address Translation 0.8 0.6 0.4 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1024 Translation latency (cycles) Must provide efficient address translation support 20
Characteristic #1: Regular Bulk Transfer of Consecutive Data (Pages) TLB miss behavior of Access of consecutive pages BlackScholes of one large memory reference benchmark Opportunities for relatively simple TLB and page walker designs 21
Characteristic #2: Impact of Data Tiling – Breaking a Page into Multiple Accelerators Page 31 Original: 32 * 32 * 32 data array … Page 16 Rectangular tiling: 16 * 16 *16 tiles Page 15 … Each tile is mapped to a different accelerator for parallel processing. Page 1 Page 0 But 1 page is split into 4 accelerators! 0 15 31 A shared TLB can be very helpful 22
Recommend
More recommend