the future is not w hat it used to be
play

The Future is not w hat it used to be... Erik Hagersten Then... - PowerPoint PPT Presentation

The Future is not w hat it used to be... Erik Hagersten Then... ENI AC 1 9 4 6 ( 5 kHz) 1 8 0 0 0 radiorr sladdprogram m erad 5 KHz AVDARK ENIAC 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten|


  1. The Future is not w hat it used to be... Erik Hagersten

  2. Then... ENI AC 1 9 4 6 ( ”5 kHz”) 1 8 0 0 0 radiorör sladdprogram m erad ”5 KHz” AVDARK ENIAC 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  3. Then ( in Sw eden)  BARK (~1950)  8 000 relays,  80 km cables  BESK (~1953)  2 400 vac. tubes  ”20 kHz” (world record) AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  4. “Recently” APZ 2 1 2 , 1 9 8 3 Ericsson’s Supercom puter ( “5 MHz”) AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  5. APZ 2 1 2 m arketing brochure quotes:  ”Very compact”  6 times the performance  1/6:th the size  1/5 the power consumption  ”A breakthrough in computer science”  ”Why more CPU power?”  ”All the power needed for future development”  ”…800,000 BHCA, should that ever be needed”  ”SPC computer science at its most elegance”  ”Using 64 kbit memory chips”  ”1500W power consumption AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  6. 6 5 years of “im provem ents”  Speed  Size  Price  Price/performance  Reliability  Predictability  Energy  Safety  Usability…. AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  7. ”Moore’s Law ” Pop: Double perform ance every 1 8 -2 4 th m onth Perform ance [ log] Multicore 1000 Single-core 100 10 1 Year 2006 AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  8. Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/ AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  9. Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/ AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  10. Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/ AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  11. Exponentiell utveckling: Doublerings/ halverings-tider ( according to Kurzw eil) Dynam ic RAM Mem ory ( bits per dollar) 1 .5 years  Average Transistor Price 1 .6 years  Microprocessor Cost per Transistor Cycle 1 .1 years  Total Bits Shipped 1 .1 years  Processor Perform ance in MI PS 1 .8 years  Transistors in I ntel Microprocessors 2 .0 years  Log scale 1000 100 10 1 time AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  12. Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/ AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  13. Linear scale 1 9 4 0  2 0 1 7 ( 2 x perform ance every 1 8 th m onth) Doubling every 18th month since 1940 4,E+15 3,E+15 Performance 3,E+15 2,E+15 2,E+15 1,E+15 5,E+14 0,E+00 40 50 60 70 80 90 0 10 Year AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  14. Exponentiell utveckling Exam ple: Doubling every 2 nd year How long does it it take for 1 0 0 0 x im provem ent? Exam ple: Doubling every 1 8 th m onth How long does it it take for 1 0 0 0 x im provem ent? Log scale 1000 100 10 1 time ? Linear scale AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  15. Looking Forw ard Three rules of common wisdom:  Do not bet against exponential trends  Do not bet against exponential trends  Do not bet against exponential trends But, is it possible to continue ”Moore’s Law”? Are there show-stoppers? - Can we utilize an exponential growth of - #cores? AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  16. Not everything scales as fast! Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D 3,5 3 2,5 Throughput 2 1,5 1.0 1 0,5 0 1 2 3 4 Number of Cores Used Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  17. Nerd Curve: 4 7 0 .LBM Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate 5 ,0 % 3 ,5 % cache size  Less amount of work Running Running per memory byte moved four threads one thread @ four threads AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  18. Rem em ber: I t is getting w orse! Computation vs Bandwidth #Cores ~ #Transistors CPU CPU 6 # T * T _ f r e q / # P * P _ f r e q 5 CPU CPU 4 #Pins 3 2 DRAM 1 0 2007 2008 2009 2010 2011 2012 2013 2014 2015 Y e a r Source: I nternatronal Technology Roadm ap for Sem iconductors ( I TRS) From Karlsson and Hagersten. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution . IPDPS March 2007. [graph updated with more recent data] HPCwire Feb 2011 [cites Linley Gwennap and Justin Rattner] W ithout Silicon Photonics, Moore's Law W on't Matter HPCwire Feb 2011 Grow ing Data Deluge Prom pts Processor Redesign AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  19. Case study: Lim ited by bandw idth AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  20. Nerd Curve ( again) Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate orig application 5 ,0 % 2 ,5 % optimized application cache size  Twice the amount of work Running per memory byte moved four threads AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  21.  Better Mem ory Usage! Example: 470.LBM Modified to promote better cache utilization 3,5 3 2,5 Througput 2 1,5 1 0,5 0 1 2 3 4 # Cores Used Original code AVDARK 21 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  22. Example 2: A Scalable Parallel Application Performance 4 3 2 1 0 1 2 3 4 # Cores App: Cigar Looks like a perfect scalable application! Are we done? AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  23. Example 2: The Same Application Optimized Performance 30 7.3x Original 25 Optimized 20 15 10 5 0 1 2 3 4 #Cores App: Cigar Looks like a perfect scalable application! Are we done?  Duplicate one data structure AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  24. I m plem entation Trends

  25. Predicting the future is hard Predicting: “Chip Multiprocessor” aka Multicores [ from PARA Bergen 2 0 0 0 ] Mem Chip Multiprocessor (CMP): Simple fast CPU External Mem -- many open I/F I/F questions L2$ $1 $1 $1 $1 CPU CPU CPU CPU treads t AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  26. Multi-CMPs [ from PARA Bergen 2 0 0 0 ] Explicit parallelism: Mem # chips x # threads/chip Mem • Global shared memory • Global/local comm cost >10 Mem • Gotta’ explore small caches c chips Interconnect • Gotta’ explore locality! Mem • OS scalability ? Mem • Application scalability ? Mem Mem Mem AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  27. W hy Multicores Now ? -- Hur Mår ”Moore’s Lag”? -- Multi core Perf [log] Single core time ~2007 Not enough ILP/MLP to get payoff from 1. using more transistors Signal propagation delay » transistor delay 2. Power consumption P dyn ~ C • f • V 2 3. AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  28. Darling, I shrunk the com puter Sequential execution ( ≈ one program) Mainframes Super Minis: Microprocessor: Mem Paradigm Shift Need TLP to Mem Chip Multiprocessor (CMP): m ake one A multiprocessor on a chip! chip run fast AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

  29. HPC in the Rear Mirror... * Promise of performance MC + Accelerators * Forced by † ???? technology MC Clusters * COTS cost † ???? convergence Beowulf x86 Linux Clusters * UNIX † COTS perf Commercial management Killer Micro SMPs computing † High cost, * Scalability Bad scaling Naive view Nifty Parallel † Hard to use Vector No standards † Not general Expensive ???? 2000 2010 1990 1980 AVDARK 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

Recommend


More recommend