Challenging the Intel Xeon: ARM and OpenPower Now you really have - PowerPoint PPT Presentation

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize

Mighty Intel … • “Intel had a 99.2 percent market share in server chips” (IDC, 2015 – Quoted on InfoWorld) • “We started experimenting with SoCs two years ago. … didn't work well because the single-thread performance was too low, resulting in higher latency for our web platform” – Facebook Engineering

…sits solid on the Throne • Best & most mature process technology in the world – 14 nm finfet trigate (2014) • Power management the competition can only dream off • Richest software ecosystem

Sizing Servers • Established in 2006 at Howest*, funded by Flemish gov since 2007 • 4 – 6 FTE (2007-2016) • 2 – 3 trainees • Specialized in independent performance optimization research • Howest = Technical University in West- Flanders (Kortrijk, Belgium)

March 2016 IWT VIS TR 135096

March 2012 • Java performance – + 60% for Xeon E5 v1 – +19% for Xeon E5 v4 • OLTP – + 51% for Xeon E5 v1 – +19% for Xeon E5 v4

Recognize this one? • Moore’s law • “ were shrinking so fast that every year twice as many could fit onto a chip. • 1975 “adjusted the pace to a doubling every two years”

There is Moore • CPU processing power per dollar • DRAM & NAND: price per megabit – a 35% per year reduction in price • Also drives the Cloud / Internet • “Google will do anything to beat Moore’s law ”

MOORE'S LAW IS “SILICON VALLEY'S BEATING HEART””

The Thermal Wall: 2004

A few examples today Power Min die Density Product line Cores Clock Year Name Process size Power Historical ref points 103 Pentium 4 1 3,8 2004 "Prescott" 65nm 112 115 27 Pentium 3 1 1 1999 "Coppermine"180 nm 106 29 Today 75 Core i7-6xxx 4 4 2016 "Sky Lake" 14 nm 122 91 57 Xeon E5 8 3,4 2016 "Broadwell" 14 nm 246 140 50 Core i7 4xxx 4 4 2014 "Hasswell" 22 nm 177 88 GPUs 58 GeForce 1000 3584 1,6 2014 "Pascal" 16 nm 520 300 44 GeForce 800 2880 0,9 2016 "Kepler" 28 nm 571 250

A bumpy road • 90 nm (2004), strained Silicon (35% faster switching) • 45 nm (2008) “high -k dieelectric ” – reduced leakage • 22 nm (2012) “ Trigate ” ( reduce both swithing and leakage power) – Research started in 2002!! • THE WALL: photolithography process light with a 193 nanometre wavelength – EUV (13,5 nm)

2013 • Still optimistic • Intel, AMD, TSMC, GlobalFoundries , and IBM => • “Moore’s Law Roadmap”

2016 • 10 nm Postponed to late 2017 • 7 nm: Big Question mark! • NO more Silicon, but Indium Gallium Arsenide (InGaAs) at 7 nm • Nanotubes? Graphene?

• 4% loss per generation!

Problem: big data gets brains • Data gets too complex for humans to analyze

And Now? • Field Programmable Gate Array (FPGA) • ASICs (App Specific IC) • Graphical Processing Unit (GPU) • MIC (Many Integrated Cores) IWT VIS TR 135096

IWT VIS TR 135096

The market has changed too EVOLVING MARKET, NEW PLAYERS

Total Market: something has changed

Cavium Thunder-X • First 64-bit ARM server vs “ mid range” Xeon E5 • 48 “ simple 2 IPC” cores @ 2 GHz @ 120W – Single thread perf is 3-5x lower • 28 nm technology • Gigabyte servers

Software ecosystem • No Java Native Access Libraries • Spark crashes with machine language message • MySQL, LAMP , most Java applications work

Performance / watt

Conclusion ARMv8 (64) • Niche oriented Cavium Thunder-X • Future chips of Qualcomm, Cavium (MaybeAvago Broadcomm) • AMD & AppliedMicro not competitive (yet??) • A few big customers: – Paypal (VPN, firewall, some webservices) – Already conquering the Chinese market (HiSilicon, HuaWei) • Fragmented market • Still unmature ecosystem: – JNA & ElasticSearch, Spark

OPENPOWER

POWER8 disadvantages • Very power hungry: 10 cores @ 190 W TDP + Mem buffers (60-80W) vs 22 cores @ 145W Xeon • JNA not supported • Some software still a bit unoptimized (MySQL)

When OpenPOWER makes sense • Based upon most complex core on the market (8 threads, 8 IPC, 3.5+ GHz) • (Some) Pricing competitive with HP/Dell • 32 DIMM slots per CPU (Intel: 12) • Open from firmware to Software • Google & Rackspaces have a new OpenPOWER server • Some software runs as fast as best Xeons (MongoDB, PostGreS) • Software ecosystem has grown fast …

OpenPower Ecosystem

IBM: first integrator of NVLink

“ Deep Learning” P100

Page Migration Engine & POWER8 with NVLink Barriers to Entry Removed • Far easier to create new applications on Tesla P100 • NVIDIA Page Migration Engine ensures unified Too Large a memory space Memory Too complicated to move data Space Required • Unified memory: address space spans CPU and GPU, 1TB+ • Hardware managed transfers: eliminates explicit data transfers Too much • custom T esting program implementing these advantages Moves too coding for much data – POWER8 with NVLink ensures speedy data throughput GPU data movement • 1TB memory space requires faster CPU:GPU data movement • Bus masks transfer times Software UVM Requires page faulting – Close code-base to parallel CPU code support feature too limiting | 3 8

Percona MySQL 5.7

Few Large or many small nodes? SPARK TESTING

Our test 300 GB GZIP “Common Crawl” Web archives Body tekst extract by “ BoilerPipe ” Natural Language Processing (Stanford) Aggregate: Group by & Sort entity counts Generate recommendations w Alternating Least Square IWT VIS TR 135096

Realtime in-memory processing with Spark

Spark Optimization • Number of virtual cores per executor (JVM): – 1 per 2 logical cores (Intel: 1, IBM: 4) • Number of executors = number of physical cores – 1 • spark.default.parallelism = +/- 1,5-2 tasks per executor • GCThreads= 1 per virtual core per executor • Speed up = 10-20%

• 20% gain per generation

Conclusions so far • Moore’s law is dead: opportunity for niche players • OpenPower has some tangible advantages • Next generation of ARM servers should be watched • New innovations … – Combining streaming, sensor data & static data – Deep learning • … will require much more tuning & specialized chips

Rate My Session!

Challenging the Intel Xeon: ARM and OpenPower Now you really have - PowerPoint PPT Presentation

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel Intel had a 99.2 percent market share in server chips (IDC, 2015 Quoted on InfoWorld) We started experimenting with SoCs two

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

SuperVessel: The Open Cloud Service for OpenPOWER Yonghua Lin, Ling Shao IBM SuperVessel

OpenPower Jeremy Kerr Firmware developer IBM Linux Technology Center jk@ozlabs.org Firmware

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC15

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

Separation Logic for Non-local Control Flow and Block Scope Variables Robbert Krebbers Joint

From ML to MLF Graphic type constraints with efficient type inference Boris Yakobowski, Didier

MiniBooNE, LSND, and Future Very-Short Baseline , LSND, and Future Very-Short Baseline MiniBooNE

GPU on KVM Gabriel Laskar <gabriel@lse.epita.fr> Introduction How can we have 3D

ML Type Inference and Unification Arlen Cox Research Goals Easy to use, high performance

Security on Plastics: Fake or Real? Nele Mentens, Jan Genoe, Thomas Vandenabeele, Lynn

Performance Evaluation for Petascale Quantum Simulation Tools

Nouveau Recap, on-going and future work Martin Peres & the Nouveau community Ph.D. student

Sambuz

Useful Links

Newsletter

Mail Us

Challenging the Intel Xeon: ARM and OpenPower Now you really have - PowerPoint PPT Presentation

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel Intel had a 99.2 percent market share in server chips (IDC, 2015 Quoted on InfoWorld) We started experimenting with SoCs two

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

SuperVessel: The Open Cloud Service for OpenPOWER Yonghua Lin, Ling Shao IBM SuperVessel

OpenPower Jeremy Kerr Firmware developer IBM Linux Technology Center jk@ozlabs.org Firmware

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC15

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

Separation Logic for Non-local Control Flow and Block Scope Variables Robbert Krebbers Joint

From ML to MLF Graphic type constraints with efficient type inference Boris Yakobowski, Didier

MiniBooNE, LSND, and Future Very-Short Baseline , LSND, and Future Very-Short Baseline MiniBooNE

GPU on KVM Gabriel Laskar &lt;gabriel@lse.epita.fr&gt; Introduction How can we have 3D

ML Type Inference and Unification Arlen Cox Research Goals Easy to use, high performance

Security on Plastics: Fake or Real? Nele Mentens, Jan Genoe, Thomas Vandenabeele, Lynn

Performance Evaluation for Petascale Quantum Simulation Tools

Nouveau Recap, on-going and future work Martin Peres &amp; the Nouveau community Ph.D. student

Sambuz

Useful Links

Newsletter

Mail Us

GPU on KVM Gabriel Laskar <gabriel@lse.epita.fr> Introduction How can we have 3D

Nouveau Recap, on-going and future work Martin Peres & the Nouveau community Ph.D. student