Green Multicore David Moloney, CTO, Movidius 24 November 2011
Overview • Fabless semiconductor company founded in 2005 – VC backed (completing C-round today @ 12:00) – Focus on computational imaging and video • Uniquely positioned for this market with a software-programmable media processor with state-of-the-art GOPS/W performance – Enables SW derivatives of the base silicon platform – Current 65nm product in mass-production and expected to ship 1-3M qty in 2012 – Next gen 28nm product in design will deliver the power of a desktop GPU in a 8x8mm BGA @ 350mW
Myriad of Applications Robotics Consumer Electronics Mobile phones Automotive Video/DSC Cameras Computational Cameras Medical HPC Camera Aerospace Modules Wireless Cameras
Technology - Platform Approach 3D Capture Video Edit Applications Products Software Modules 3D Video Anaglyph-3D Silicon Platform Foundation Technology 4
Mobile Video Processing Workload 5
GPU FLOPS/W Trend GPU GFLOPS/W Historical Trend 7 6 5 4 GPU GFLOPS/W Growing @ 1.4x per Year 3 2 1 0 G GT GT GT GTS GT GT GT GTS GTX GTX GTX GTX GTX GTX GTX GT GT GT GT GT GTS GTS GTX GTX GTX GTX GTX GTX 100 120 130 140 150 210 220 240 250 260 260 260 275 280 285 295 420 430 430 440 440 450 450 460 460 460 465 470 480 SE
Movidius SHAVE Processor • Unique proprietary architecture – Tailored to streaming workloads and architected for outstanding OPS/mW/$ performance • Streaming Hybrid Architecture Vector Engine – Hybrid of RISC, DSP, VLIW & GPU architectural features – 128-bit vector arithmetic: 8/16/32-bit INT & fp16/fp32 • Excellent Graphics and matrix mathematics support – HW texture unit for good graphics performance – Predicated execution to eliminate branches – Compiler-friendly architecture – HW support for compressed data-structures (ex. matrices)
Myriad Silicon Platform SW Controlled I/O Multiplexing SPI SDIO I 2 S I 2 C LCD SPI SDIO Cam SPI UART SPI USB2 SDIO SPI LCD UART LCD x3 x3 x2 MEBI SEBI FLSH x2 x2 JTAG x3 x2 x3 x3 x2 TS x3 OTG x2 x3 x2 x2 x2 64 GPS CMX CMX CMX CMX CMX CMX CMX CMX TIM 128kB 128kB 128kB 128kB 128kB 128kB 128kB 128kB TMU TMU TMU TMU TMU TMU TMU TMU L1 L1 L1 L1 L1 L1 L1 L1 RISC SVE0 SVE1 SVE2 SVE3 SVE4 SVE5 SVE6 SVE7 NAL 128 Bridge Main Bus Stacked 32 50GFLOPS/W L2 16/64MB DDR Cache (IEEE 754 SP) SDRAM die Movidius IP
65nm Myriad SoC SHAVE Variable-Length Instruction 180MHz 180MHz PEU BRU LSU0 LSU1 VAU IAU SAU CMU PEU BRU DCU SHAVE 128 128 Processor CMU kB kB 128kB 128kB 16/64 16/64 128kB Per Per MB MB IRF SRF VRF 2-way TMU SHAVE SHAVE SDRAM SDRAM 32x32 32x32 32x128 L2 128kB SRAM 1kB 1k 1k 16/64MB 17.3GB/Sec Tile IAU SAU VAU cache L1 L1 SDRAM 2.9GB/Sec LSU0 Die 8.6GB/Sec LSU1 Decoded IDC DDR2 instrs 12.2GB/Sec Cont. Myriad 128-bit AXI SHAVE Bus 1.5GB/Sec 5.8GB/Sec 5.8GB/Sec
Myriad GOPS/Watt (Arithmetic) 181 Myriad GOPS/W PEU BRU LSU0 LSU1 VAU SAU IAU CMU 200 180 99 91 GOPS/W 160 (arith) 140 49 45 120 100 80 OP/W arith 60 4 2 1 IAU 32 8 8 40 4 4 2 16 16 SAU 8 8 20 VAU 0 int8 int16 int32 fp16 fp32
Myriad 65nm CMOS LP Die Analog RISC sub-system 16MB SDRAM DIE Myriad DIE 16MB Stacked SDRAM 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 SHAVE CMX CMX SHAVE 0 1 2 3 4 5 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 SHAVE SHAVE CMX CMX A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 SHAVE SHAVE CMX CMX A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 SHAVE SHAVE CMX CMX A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 Author Year FLOPS/core Cores GFLOPS W GFLOPS/W Myriad Movidius 2011 12 8 17.28 0.35 49.4 (1 KAIST 2011 5.8 0.28 21.1 (2 Intel 2007 80 1000 98.00 10.2 (4 Adapteva 2010 2 16 24.96 1.00 25.0
Now I’ve got a Green Compute Platform What can I do with it?
MA1135 - 3D Converter Box Application HDMI in image HDMI “stripes” out 20/Apr/2011 13
Myriad Example Applications SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 14
Application Development Power Optimize Profile Compile Partition App X86 C-code Intel Parallel Movidius C- Movidius Movidius Data-layout Studio compiler -LLVM Profiler Assembly DRAM access Optimizer Visual Studio DMA for Refactor design SHAVE0 streams Code Data-layout SHAVE1 Run quickly and transformations switch off to Use of DMA to SHAVE2 Loop Unrolling minimize handle streams SHAVE3 leakage Inline assembler SHAVE4 Optimise clock- rates for each SHAVE5 SHAVE SHAVE6 Power-off SHAVE7 domains
Lightfield Requirements • Replaces glass with SW – CUDA implementation of Giorgiev (Adobe) LF algorithm – Very computationally expensive – Interpolation key kernel – Geforce GT120 at 130 GFLOPs Lytro and 50W (2.6GFLOPs/W) • http://en.wikipedia.org/wiki/G eForce_100_Series – GPU completes refocusing in 30ms (33.3fps) – 4fps on Myriad 65nm Raytrix
Performance Roadmap (Nvidia) http://bit.ly/t6zo2j
Fragrak 28nm Platform SW Controlled I/O Multiplexing MIPI SPI I 2 S SDIO I 2 C SPI MIPI SDIO SPI UART SPI DSI USB2 SDIO SPI LCD UART x3 LCD x3 x2 x2 MEBI SEBI FLSH JTAG x3 CSI 2x x3 x3 x2 TS x3 2x OTG x2 x3 x2 x2 x2 64 GPS CMX CMX CMX CMX CMX CMX CMX CMX TIM 128kB 128kB 128kB 128kB CMX CMX CMX CMX 128kB 128kB 128kB 128kB CMX CMX CMX CMX SHAVE SHAVE SHAVE SHAVE 128kB 128kB 128kB 128kB SHAVE SHAVE SHAVE SHAVE 256kB 256kB 256kB 256kB 0 0 0 0 SHAVE SHAVE SHAVE SHAVE 1 1 1 1 RISC SHAVE SHAVE SHAVE SHAVE 0 0 0 0 04 08 12 16 NAL ICB ICB ICB ICB XCB Brid ge Main Bus 128 Stacked 64 L2 DDR3 256/512MB 450GFLOPS/W 512kB LP SDRAM die Movidius IP (IEEE 754 SP) 18
1000.00 100.00 10.00 0.10 1.00 GeForce G 100 0.40 Tesla C870 2.02 GeForce GT 120 GeForce GT 130 GeForce GT 140 GeForce GTS 150 Fermi GT 420 GeForce GT 210 GFLOPS/W in Context GeForce GT 220 7 Years to hit 50GFLOPS/W! Tesla C1060 GeForce GT 240 GPU rate of increase GeForce GTS 250 Fermi GTX 465 3.95 1.4x per Year GeForce GTX 260 Fermi GTS 450 GeForce GTX 260 Tesla C2050/C2070 GeForce GTX 260 Fermi GT 430 GeForce GTX 275 Tesla M2050 Tesla M2070/M2070Q GeForce GTX 280 GeForce GTX 285 GeForce GTX 295 Fermi GT 440 GeForce GT 420 Fermi GTX 460 SE GeForce GT 430 4.99 Fermi GTX 470 GeForce GT 430 GeForce GT 440 GeForce GT 440 Fermi GTX 480 GeForce GTS 450 Fermi GT 430 GeForce GTS 450 65nm 2011 Movidius 28nm 2012 Movidius GeForce GTX 460 SE Fermi GTS 450 GeForce GTX 460 Fermi GTX 460 GeForce GTX 460 Fermi GTX 460 6.05 GeForce GTX 465 Fermi GT 440 GeForce GTX 470 GeForce GTX 480 6.19 Myriad 438.86 49.37 Myriad2
Myriad of Cameras – 1 Platform • Standard camera – All optical focusing: bulky lenses & autofocus for close-ups – Wide aperture good for low-light but limits depth-of-field – Scale and cost due to established manufacturing processes • Lightfield camera (Plenoptic = Lightfield) – Post-capture refocusing in software (Lytro) – Computationally expensive (GPU-based = cloud – Decouples aperture from Depth of Field (DoF) • Array Camera (Stereo is a 2x1 special case) – Uses array of MxN completely focused cameras – Composite & interpolate array of low-res cameras (Levoy) – Individual camera control allows: HDR capture, fault- tolerance, slow-motion, power-saving etc.
Movidius Computational Imaging Applications Conventional Cameras Software Modules Products 3D Stereo Cameras Silicon Platform Lightfield Cameras Foundation Technology Array Cameras Tiny 8x8mm Myriad BGA
Summary • Movidius 65nm silicon platform – Ground-breaking functionality in SW – Enabled by ground-breaking GFLOPS/W – Compact form-factor – In mass-production today – 10x better GFLOPS/W than GPU • Next generation 28nm SoC – 9x perf/watt available in 2012 – 100x better GFLOPS/W than GPU 22
Any questions? The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n°248481 (PEPPHER Project, www.peppher.eu)
Recommend
More recommend