GTC 2015 – Session S5429 Creating Dense Mixed GPU and FPGA Systems With Tegra K1s Using OpenCL & CUDA Lance Brown, Director - HPC ColoradoEngineering.com Lance.brown@coloradoengineering.com 719-641-7287 Cell 27 March 2015 ColoradoEngineering.com - Public Release 1
We Can Solve Really Cool Problems Now • Heterogeneous computing is more than CPU + GPU • ARM processors changed the game • NVIDIA - GPU + ARM - CUDA • TI - DSP + ARM - OpenCL • Altera - FPGA + ARM – OpenCL • Scalable from handheld to Enterprise & HPC 27 March 2015 ColoradoEngineering.com - Public Release Slide 2
Why Listen to CEI? • Been using FPGAs since 1985 • Been solving massively parallel problems for over 30 years • We have/are designing multiple 24 & 32 layer boards featuring Altera FPGAs & NVIDIA GPUs • Early adopter of new technologies and experts at marrying existing technologies in new ways 27 March 2015 ColoradoEngineering.com - Public Release Slide 3
Game Changer #1 Altera’s Hard Floating Point Unit IP & OpenCL • FPGAs have traditionally supported soft floating point • Altera introduced IEEE 754 Hard Floating Point with Arria 10 • Arria 10 FPGAs are rated from 140 GigaFLOPS (GFLOPS) to 1.5 TeraFLOPS (TFLOPS) • Details at: https://www.altera.com/en_US/pdfs/literature/po/bg- floating-point-fpga.pdf • OpenCV & Suricata Implementations Using OpenCL • Partial Reconfiguration for Streamlined OpenCL Development • On Intel’s 14 nm FinFET Fab 27 March 2015 ColoradoEngineering.com - Public Release Slide 4
Game Changer #2 NVIDIA Makes Tegra K1 Available • GPU + ARM @ low power • Very important – camera interfaces galore • Can do significant processing at each edge node now • Jetson Kit – awesome eval kit & affordable • More importantly – chipset available through Arrow! • Details at: https://developer.nvidia.com/hardware-design-and- development 27 March 2015 ColoradoEngineering.com - Public Release Slide 5
CEI’s Epiphany – Ultimate CV Platform Altera Arria 10 & NVIDIA Tegra K1? + 1500 GFLOPS 326 GFLOPS 27 March 2015 ColoradoEngineering.com - Public Release Slide 6
First Union – Dual TK1s + Arria 10 HPC-A10-K1GPU SMA USB USB USB DisplayPort - Source DisplayPort - Sink SMA SMA SMA CLK-IN 3.0 3.0 Blaster USB GigE HDMI USB GigE HDMI USB 2/4/8 2/4/8 GigE HDMI 2.0 Gbit Gbit DDR3 DDR3 16/32/64 16/32/ 16/32/ 1/2/4 GB QSFP+ 64 GB 64 GB GB 1 – 40 GbE 2 Inches DDR3L eMMC eMMC eMMC 4 - 10 GbE 2/4 GB Micron VITA 57 FMC Tegra K1 System-On-Module TK1-SOM Tegra K1 System-On-Module TK1-SOM HMC HPC JTAG (Optional) QSFP+ 1 – 40 GbE UART 4 - 10 GbE X4 PCIE GEN2 EXTRA X4 PCIE GEN2 K61 Health QDR II+ QDR II+ Monitoring External Power x4 PCI Gen2, Clocks, i2c 144 Mb 144 Mb 1334 MT/s 1334 MT/s GigE TK1-SOM Tegra K1 System-On-Module PCIE HPC-A10 Switch HPC-A10-K1GPU 2 Inches X8 PCIE Gen3 Available Stand-alone 27 March 2015 ColoradoEngineering.com - Public Release Slide 7
HPC-A10-K1GPU Design Details • NVIDIA GPUDirect Support • TK1’s are root nodes • TK1’s can be field upgraded • 8 - High Speed 10GbE Ports • CUDA on TK1 • OpenCL on Arria 10 • 2 GB/s to each TK1 • HMC is 17X faster than DDR3 • 12 to 25 Camera/Sensor I/Os 27 March 2015 ColoradoEngineering.com - Public Release Slide 8
Single Node C Display Port • 1 to 21 Cameras/Sensors C 4 – 10 GbE C USB/GigE C • Makes dumb cameras smart C C C C • 10/40 GbE Sensors C • OpenCL on FPGA C C 4 – 10 GbE USB/GigE • CUDA on Tegra C C C C C C FMC C C C C 27 March 2015 ColoradoEngineering.com - Public Release Slide 9
Tesla K80s + HPC-A10-K1GPU C Display Port C 4 – 10 GbE C USB/GigE C C C C C C Telsa K80 GPUDirect C Telsa K80 4 – 10 GbE USB/GigE C C Telsa K80 C Telsa K80 FMC C C C C 27 March 2015 ColoradoEngineering.com - Public Release Slide 10
Sensor Gateway Smart Host Bus Adapter (HBA) 40 GbE 40 GbE FMC Telsa K80 Cluster 40 GbE Sensor Cloud Radar, MRI, PET, Camera, EW, etc 40 GbE 40 GbE FMC Telsa K80 Cluster 40 GbE 27 March 2015 ColoradoEngineering.com - Public Release Slide 11
Programming FPGAs with OpenCL • Easy to do now • https://youtu.be/o5WtYiY5Hao • Proficient in a day or two • CAPI support too • 95% to 99% Efficient as VHDL 27 March 2015 ColoradoEngineering.com - Public Release Slide 12
EDGE Node Processing • Process on the EDGE using GRID Directional Mic • Distributed deep learning node Patch Antenna 5 MP Camera COMMS Alerts • Low cost Streaming Video 4G LTE WiFi BlueTooth • 4G enabled USB • Fusion of Radar, EO, IO and Sound NVIDIA Directional Mic Directional Mic Tegra K1/X1 • Download apps from Google Play 5 MP Camera 5 MP Camera Computer Vision Patch Antenna Patch Antenna Video Compression • Feedback to Tesla K80s via GRID • SmartCity Ready Altera 24 GHz Radar System Cyclone V • Military Level Device Security Built-in Appliance Patch Antenna Directional Mic 5 MP Camera Motion Detection Security Camera Queuing 27 March 2015 ColoradoEngineering.com - Public Release Slide 13
Distributed Aperture System Distributed Sensors USB3 or USB3 or USB3 or USB3 or HDMI HDMI HDMI HDMI GigE GigE GigE GigE NVIDIA NVIDIA NVIDIA NVIDIA 64 GB 64 GB 64 GB 64 GB • Large vehicle/Military ADAS Tegra X1 Tegra X1 Tegra X1 Tegra X1 eMMC eMMC eMMC eMMC x4 ARM 8 GB x4 ARM 8 GB x4 ARM 8 GB x4 ARM 8 GB DDR4 DDR4 DDR4 DDR4 CUDA/Linux CUDA/Linux CUDA/Linux CUDA/Linux OpenCV OpenCV OpenCV OpenCV H.264/H.265 H.264/H.265 H.264/H.265 H.264/H.265 • SA360 systems x4 Gen2 PCIe x4 Gen2 PCIe x4 Gen2 PCIe 2 GB/S 2 GB/S x4 Gen2 PCIe • Retrofit casino camera systems 2 GB/S 2 GB/S USB3 or GigE 4/8 GB • Make any sensor system smart Main Display GPU HMC NVIDIA Altera 64 GB QDR-II+ Tegra X1 eMMC Or Arria 10 SoC x4 Gen2 PCIe x4 ARM 8 GB QDR-IV • Tegra K1/X1’s Scalable DDR4 x2 ARM 2 GB/S CUDA/Linux OpenCV OpenCL Removable SATA Storage H.264/H.265 40/10 GbE Ports • Mixture of CUDA & OpenCL HDMI x4 Gen2 PCIe x4 Gen2 PCIe x4 Gen2 PCIe x4 Gen2 PCIe 2 GB/S 2 GB/S 2 GB/S 2 GB/S NVIDIA NVIDIA NVIDIA NVIDIA 64 GB 64 GB 64 GB 64 GB Tegra X1 Tegra X1 Tegra X1 Tegra X1 eMMC eMMC eMMC eMMC x4 ARM 8 GB x4 ARM 8 GB x4 ARM 8 GB x4 ARM 8 GB DDR4 DDR4 DDR4 DDR4 CUDA/Linux CUDA/Linux CUDA/Linux CUDA/Linux OpenCV OpenCV OpenCV OpenCV H.264/H.265 H.264/H.265 H.264/H.265 H.264/H.265 USB3 or USB3 or USB3 or USB3 or HDMI HDMI HDMI HDMI GigE GigE GigE GigE 27 March 2015 ColoradoEngineering.com - Public Release Slide 14
Challenges Hardware, Interconnects & Software • FPGA + GPU • CUDA, OpenCL or CUDA + OpenCL • Working with MDA & AFRL on solutions • Bandwidth • Tegra K1/X1 are x4 Gen2 PCIe – limits number and resolution of sensors attached to the Tegra. • More processing has to be done of Tegra, but that is okay since Tegra’s keep increasing in power every year • Gen3 PCIe would be awesome • PCIe backplane – Using 40 GbE ports eliminates PCIe bottleneck • Root Nodes • Tegra wants to root complex. Non-transparent switches need to be used • If Tegra could be an endpoint, a whole new world would open up 27 March 2015 ColoradoEngineering.com - Public Release Slide 15
Future Architectures Even Cooler Designs Possible • Altera • Arria 10 SoC • Eliminates need for x86 CPU to run OpenCL • Truly stand-alone appliances • 100 GbE interfaces • Stratix 10 and Stratix 10 SoC • >10 TFLOPs for 100W • Details: https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.html • NVIDIA VOLTA • Looking for NVLink intermingling with FPGAs • Virtual FPGAs + Virtual GPUs • Allow instant scaling and data protection 27 March 2015 ColoradoEngineering.com - Public Release Slide 16
Summary • GPU + FPGA can solve amazing and fun problems • Tegra K1/X1 provide incredible capability at low cost which reduces the size of FPGA needed. • OpenCL and Hard Floating Point IP make the Altera FPGAs a great partner with NVIDIA GPUs • CEI is making scalable solutions to allow application developers to deploy from handheld to enterrpise/HPC 27 March 2015 ColoradoEngineering.com - Public Release Slide 17
Hardware & Software Capabilities ▪ System / Subsystem Designs • Enterprise & Embedded SW • Net Centric, SOA, web services, J2EE,SQL ▪ 30+ complex board designs • C/C++ • ▪ CUDA & OpenCL 32 layer PCBs with blind and buried vias • Embedded real time code, RTOS, hardware ▪ High speed (100s MHz x GHz) drivers, Fault Detection / Fault Isolation, etc. • ▪ Simulations, APIs, and GUIs Analog (RF & I/Q Receivers) • Cognitive Software ▪ Digital (FPGAs, DSPs, general purpose) • Device Drivers ▪ ADC and DAC • National Instruments Labview • DO-178C ▪ Standard and custom IO (busses, fabrics, SerDes, etc.) • FPGA designs (VHDL/Verilog/Simulink) ▪ Ruggedization and thermal management • RF Design ▪ CSWaP ▪ Serial I/O (e.g. PCIe, Serdes) ▪ DO-254 27 March 2015 ColoradoEngineering.com - Public Release 18
For More Information on Standard Products and Custom Engineering Services Call Us – 719-388-8582 Office Emails Us – lance.brown@coloradoengineering.com Visit Us – Colorado Springs, CO (Sunny 300+ Days) Browse Us – www.ColoradoEngineering.com 27 March 2015 ColoradoEngineering.com - Public Release 19
Recommend
More recommend