Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize
Mighty Intel … • “Intel had a 99.2 percent market share in server chips” (IDC, 2015 – Quoted on InfoWorld) • “We started experimenting with SoCs two years ago. … didn't work well because the single-thread performance was too low, resulting in higher latency for our web platform” – Facebook Engineering
…sits solid on the Throne • Best & most mature process technology in the world – 14 nm finfet trigate (2014) • Power management the competition can only dream off • Richest software ecosystem
Sizing Servers • Established in 2006 at Howest*, funded by Flemish gov since 2007 • 4 – 6 FTE (2007-2016) • 2 – 3 trainees • Specialized in independent performance optimization research • Howest = Technical University in West- Flanders (Kortrijk, Belgium)
March 2016 IWT VIS TR 135096
March 2012 • Java performance – + 60% for Xeon E5 v1 – +19% for Xeon E5 v4 • OLTP – + 51% for Xeon E5 v1 – +19% for Xeon E5 v4
Recognize this one? • Moore’s law • “ were shrinking so fast that every year twice as many could fit onto a chip. • 1975 “adjusted the pace to a doubling every two years”
There is Moore • CPU processing power per dollar • DRAM & NAND: price per megabit – a 35% per year reduction in price • Also drives the Cloud / Internet • “Google will do anything to beat Moore’s law ”
MOORE'S LAW IS “SILICON VALLEY'S BEATING HEART””
The Thermal Wall: 2004
A few examples today Power Min die Density Product line Cores Clock Year Name Process size Power Historical ref points 103 Pentium 4 1 3,8 2004 "Prescott" 65nm 112 115 27 Pentium 3 1 1 1999 "Coppermine"180 nm 106 29 Today 75 Core i7-6xxx 4 4 2016 "Sky Lake" 14 nm 122 91 57 Xeon E5 8 3,4 2016 "Broadwell" 14 nm 246 140 50 Core i7 4xxx 4 4 2014 "Hasswell" 22 nm 177 88 GPUs 58 GeForce 1000 3584 1,6 2014 "Pascal" 16 nm 520 300 44 GeForce 800 2880 0,9 2016 "Kepler" 28 nm 571 250
A bumpy road • 90 nm (2004), strained Silicon (35% faster switching) • 45 nm (2008) “high -k dieelectric ” – reduced leakage • 22 nm (2012) “ Trigate ” ( reduce both swithing and leakage power) – Research started in 2002!! • THE WALL: photolithography process light with a 193 nanometre wavelength – EUV (13,5 nm)
2013 • Still optimistic • Intel, AMD, TSMC, GlobalFoundries , and IBM => • “Moore’s Law Roadmap”
2016 • 10 nm Postponed to late 2017 • 7 nm: Big Question mark! • NO more Silicon, but Indium Gallium Arsenide (InGaAs) at 7 nm • Nanotubes? Graphene?
• 4% loss per generation!
Problem: big data gets brains • Data gets too complex for humans to analyze
And Now? • Field Programmable Gate Array (FPGA) • ASICs (App Specific IC) • Graphical Processing Unit (GPU) • MIC (Many Integrated Cores) IWT VIS TR 135096
IWT VIS TR 135096
The market has changed too EVOLVING MARKET, NEW PLAYERS
Total Market: something has changed
Cavium Thunder-X • First 64-bit ARM server vs “ mid range” Xeon E5 • 48 “ simple 2 IPC” cores @ 2 GHz @ 120W – Single thread perf is 3-5x lower • 28 nm technology • Gigabyte servers
Software ecosystem • No Java Native Access Libraries • Spark crashes with machine language message • MySQL, LAMP , most Java applications work
Performance / watt
Conclusion ARMv8 (64) • Niche oriented Cavium Thunder-X • Future chips of Qualcomm, Cavium (MaybeAvago Broadcomm) • AMD & AppliedMicro not competitive (yet??) • A few big customers: – Paypal (VPN, firewall, some webservices) – Already conquering the Chinese market (HiSilicon, HuaWei) • Fragmented market • Still unmature ecosystem: – JNA & ElasticSearch, Spark
OPENPOWER
POWER8 disadvantages • Very power hungry: 10 cores @ 190 W TDP + Mem buffers (60-80W) vs 22 cores @ 145W Xeon • JNA not supported • Some software still a bit unoptimized (MySQL)
When OpenPOWER makes sense • Based upon most complex core on the market (8 threads, 8 IPC, 3.5+ GHz) • (Some) Pricing competitive with HP/Dell • 32 DIMM slots per CPU (Intel: 12) • Open from firmware to Software • Google & Rackspaces have a new OpenPOWER server • Some software runs as fast as best Xeons (MongoDB, PostGreS) • Software ecosystem has grown fast …
OpenPower Ecosystem
IBM: first integrator of NVLink
“ Deep Learning” P100
Page Migration Engine & POWER8 with NVLink Barriers to Entry Removed • Far easier to create new applications on Tesla P100 • NVIDIA Page Migration Engine ensures unified Too Large a memory space Memory Too complicated to move data Space Required • Unified memory: address space spans CPU and GPU, 1TB+ • Hardware managed transfers: eliminates explicit data transfers Too much • custom T esting program implementing these advantages Moves too coding for much data – POWER8 with NVLink ensures speedy data throughput GPU data movement • 1TB memory space requires faster CPU:GPU data movement • Bus masks transfer times Software UVM Requires page faulting – Close code-base to parallel CPU code support feature too limiting | 3 8
Percona MySQL 5.7
Few Large or many small nodes? SPARK TESTING
Our test 300 GB GZIP “Common Crawl” Web archives Body tekst extract by “ BoilerPipe ” Natural Language Processing (Stanford) Aggregate: Group by & Sort entity counts Generate recommendations w Alternating Least Square IWT VIS TR 135096
Realtime in-memory processing with Spark
Spark Optimization • Number of virtual cores per executor (JVM): – 1 per 2 logical cores (Intel: 1, IBM: 4) • Number of executors = number of physical cores – 1 • spark.default.parallelism = +/- 1,5-2 tasks per executor • GCThreads= 1 per virtual core per executor • Speed up = 10-20%
• 20% gain per generation
Conclusions so far • Moore’s law is dead: opportunity for niche players • OpenPower has some tangible advantages • Next generation of ARM servers should be watched • New innovations … – Combining streaming, sensor data & static data – Deep learning • … will require much more tuning & specialized chips
Rate My Session!
Recommend
More recommend