' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ E�ciency of numerical algo rithms on future high p erfo rmance sup ercomputers Ulrich R� ude Institut f� ur Mathematik Universit � at Augsburg http://scicomp.math.uni-augsburg.de/rue de/me. html DF G p roject: Datenlok ale Iterationsverfahren Ma rch 1998 � � Title F98 - 0.0 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Outline � The e�ciency pa rado x � What is wrong ab out our algo rithms and p ro- grams � Cache o riented iterative metho ds � High p erfo rmance computer a rchitecture � Scienti�c computing in the future � � Contents F98 - 0.1 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Ideally ... Mathematical a rguments p redict that � multigrid with nested iteration can solve scala r elliptic PDE with app ro ximately { 100 op erations { sto ring 8 reals p er unkno wn � on a w o rkstation 9 { that can do 1 � 10 op erations (= 1000 MFlop) p er second 6 { in 64 � 10 w o rds (=512 MByte) of sto rage and so, w e can solve fo r 7 10 unkno wns in ab out one second . � � What it is ab out F98 - 0.2 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ In p ractice ... our p rograms 5 � can sometimes only do 10 unkno wns � on a (massively pa rallel) sup ercomputer � where it do es not run fo r hours � � F98 - 0.3 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Run time compa rison of iterative algo rithms on unifo rm grids � Standa rd Multigrid � Adaptive Multigrid � SOR � SOR with cache optimization � � What it's ab out F98 - 0.4 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ E�ciency of P oisson Solvers Benchma rk suggested b y Botta et. al. in: Ho w fast the Laplace Equation w as solved in 1995. Performance of Multigrid Poisson Solver 16 Digital PWS 500 au SGI O200, 180 Mhz 14 HP 9000/755, 99 Mhz P-II/266(SDRAM) P-Pro/200 Time per unknown (Microseconds) 12 10 8 6 4 � With 1 GFlop p erfo rmance, 250 op erations 2 p er unkno wn should b e executed in 0.25 � seconds. 0 4 6 8 10 12 14 Level L (Gridsize= 2^L) � F o r small data sets w e thus have to 25% p eak p erfo rmance, fo r la rge data sets < 7% p eak p erfo rmance � � P erfo rmance Analysis of Elliptic Solvers F98 - 1.1 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ E�ciency (cont'd) 6 � Benchma rk requires erro r reduction of 10 in the residual. � This is oversatis�ed b y 5 V(2,1) cycles costing 250 �oating p oint op erations p er unkno wn. � Using a V(2,1)-FMG algo rithm this could b e reduced b y a facto r 4. Compa rison with t w o b est p erfo rmers from Botta pap er Performance of Multigrid Poisson Solver 100 Digital PWS 500 au SGI O200, 180 Mhz 90 HP 9000/755, 99 Mhz P-II/266(SDRAM) P-Pro/200 80 MILU-rrb (on HP755) Time per unknown (Microseconds) NGILU (on HP755) 70 60 50 40 30 � � P erfo rmance Analysis of Elliptic Solvers F98 - 1.2 � � 20 10 0 4 6 8 10 12 14 Level L (Gridsize= 2^L)
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Which algo rithms and data structures sp oil p erfo rmance? P erfo rm red-black relaxation on Digital Alpha PWS 500au using a � structured grid with constant co e�cients � structured grid with va riable co e�cients � unstructured grid, implemented with link ed list, but all data ideally cache aligned � unstructured grid, data non cache aligned � structured grid, constant co e�cients, opti- mized fo r cache p erfo rmance � � P erfo rmance Analysis of Elliptic Solvers F98 - 1.3 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ P erfo rmance of RB Performance of Red Black Relaxation 700 Structured, const coeff Structured, const coeff, cache tuned Structured, variable coefficients 600 Unstructured, cache aligned access Unstructured, non-cache alined 500 400 MegaFlop 300 200 100 0 16 32 64 128 256 512 1024 2048 Gridsize Performance of Red Black Relaxation 180 Structured, variable coefficients Unstructured, cache aligned access 160 Unstructured, non-cache alined 140 120 100 MegaFlop 80 � 60 � P erfo rmance dep ending on vecto r length F98 - 1.4 � � 40 20 0 16 32 64 128 256 512 1024 Gridsize
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Memo ry Hiera rchy (DEC PWS 600) CPU 32 Registers 1000 W. Lev. 1 Cache 12000 W. Level 2 Cache 0.5 MW ext. Level 3 Cache Level Capac. (MB/s) Latency 64 MW Main Memory FP Register 256 B 28,800 1.7 ns Cache 1 8 KB 19,200 1.7 ns 1 GW Disk Space Cache 2 96 KB 9,600 5.0 ns Cache 3 2 MB 873 23.3 ns Main Mem 1,536 MB 1,070 105.0 ns � � Example Architecture E97 - 2.1 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Backus (1977): I p rop ose to call this tub e the V on-Neumann b ottleneck. What a re the consequences? T o avoid ine�ciency w e must: Avoid dynamic structures. No link ed lists, bi- na ry trees, etc. on to o lo w granula rit y . Ho w to implement spa rse matrices then? W e don't. Exploit instruction-level pa rallelism. Prepa re the co des such that automatic restructuring to ols and compilers (optimizers) can extract the pa rallelism. F90 and HPF a rra y syntax a re counter-p ro ductive, since w e also need to Exploit data lo calit y . Do not p rogram in global sw eeps! W e cannot save in fo rming Ax , 2 3 but w e can save when Ax; A x; A x; : : : a re needed. This is a wkw a rd p rogramming and in the future w e will need to ols fo r this job! � � Consequences F98 - 2.2 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ P AM - P atch Adaptive Multigrid � no des a re group ed in (non-overlapping) patches of �xed size � each level consists of a collection of patches � patches ma y b e p resent (live) o r � patches ma y b e virtual (ghosts) � � P AM F98 - 3.1 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Histo ry and F uture of Microp ro cesso rs T echnology data Y r T yp e Mhz � m T rans CPI MFlop 82 80286 12 1.5 0.14 M 30 0.4 85 80386 33 1.0 0.28 M 12 3.0 97 21164 625 0.35 9.30 M 0.5 1.25G 11 Int-X 10000 ? 1000.00 M ? ? Imp rovement F acto rs 82 { 97 97 { 2011 Mhz 50 16 T ransisto rs 65 100 M�op: 3000 ( � 50 � 65) ??? � � F uture High P erfo rmance Computers F98 - 4.1 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Dave P atterson (1997 in SIAM News): Instruction level pa rallelism is running out of steam. Interp retation � A microp ro cesso r to da y { is faster than the fastest sup ercomputer 15 y ea rs ago. { has the internal pa rallelism equivalent to the la rgest pa rallel p ro cesso r 15 y ea rs ago. � A microp ro cesso r in 2011 { could b e faster than the fastest sup ercom- puter to da y (if w e �nd a w a y to exploit what technology will mak e p ossible) { could emplo y as much internal pa rallelism as a massively pa rallel computer to da y . � � F uture High P erfo rmance Computers F98 - 4.2 � �
' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ J. Go o dman & D. Burger (1997 in a IEEE Computer edito rial): The circumstances in which computer a r- chitects will �nd themselves in the next 15 y ea rs a re truly daunting. Memo ry systems � T o sustain the p eak p erfo rmance of a 1.25 GFlop p ro cesso r to da y , the memo ry system needs a bandwidth of 30 Gb yte/sec, but t yp- ical (main) memo ry systems only deliver 1 Gb yte/sec. � A hyp othetical 1.25 TFlop p ro cesso r w ould need 30 TByte/sec memo ry bandwidth. If w e assume that this p ro cesso r will have a 4096 Bit memo ry bus, it w ould still require a bus clo ck of 60 GHz. � � F uture High P erfo rmance Computers F98 - 4.3 � �
Recommend
More recommend