Spin glass simulations on Janus R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara raffaele.tripiccione@unife.it UCHPC, Rodos (Greece) Aug. 27 th , 2012
Warning / Disclaimer / Fineprints I' m an outsider here ---> a physicist's view on an application-specific architecture A flavor of physics-motivated, performance-paranoic, (hopefully) unconventional computer architecture However a few points of contact with main-stream CS may still exist ...
On the menu today WHAT?: spin-glass simulations in short WHY?: computational challenges HOW?: the JANUS systems DID IT WORK?: measured and expected performance (and comparison with “conventional” systems) Take-away lessons / Conclusions
Our computational problem Bring a spin-glass (*) system of e.g. 48 3 grid points to thermal equilibrium: s e d i l - a challenge never attempted sofar ---> s t x e n - follow the system for 10 12 – 10 13 Monte Carlo (*) steps e h t n - on ~100 independent system instances i d e n i f e d e b Back-of-envelope estimate: o t ) * ( 1 high-end CPU for 10,000 years (which is not the same as 10,000 CPUs for 1 year ...)
Statistical mechanics in brief .... Statistical mechanics tries to describe the macroscopic behaviour of matter in terms of average values of microscopic structure An (hopefully familiar) example : Explain why magnets have a transition temperature beyond which they lose their magnetic state T
The Ising model ..... The tiny little magnets are named spins; they take just two values A “configuration” is a specific value assignment for all spins in the system The “macro”-behavior is dictated by the energy function at the “micro” level: Each spin interacts only with its nearest neighbours in a discrete D-dim mesh: U { S }=− ∑ 〈 ij 〉 J S i S j J 0 Statistical physics bridges the gap from micro to macro ....
The spin-glass model ..... Spin-glasses are a generalization of Ising systems. They are the reference theoretical model of glassy behavior Interesting per se A model of complexity Interesting for industrial applications An apparently trivial change in the energy functions makes spin-glasses much more complex than Ising systems Studying these systems is a computational nightmare ...
Why are Spin Glasses so hard?? A very simple change in the energy-function (defined on e.g. a discrete 3- D lattice) U =− ∑ NB ij J ij i j , ={ 1, − 1 } J ={ 1, − 1 } hides tremendously complex dynamics, due to the extremely irregular energy landscape in the configuration space (frustration):
Monte Carlo algorithms These beasts are best studied numerically by Monte Carlo algorithms Monte Carlo algorithms navigate in configuration space in such a way that: ----> any configuration will show up according to its probability to be realized in the real world (at a given temperature) MC algorithms come in several versions … … most versions have remarkably similar requirements in terms of their algorithmic structure.
The Metropolis algorithm An endless loop ..... Pick up one (or several) spin(s) Compute the energy U Flip it/them Compute the new energy U ' U = U ' − U Compute U 0 If accept the change unconditionally − U / KT else accept the change only with probability e pick up new spin(s) and do it again
... just a few C lines
Monte Carlo algorithms - t o n Common features: s ( e l r l a o m c ) s l l bit-manipulation operations on spins (+ LUT access) y a n m l a o s r (good-quality/long) random numbers m t o n o o t a huge degree of available parallelism c d e r i w d regular program flow (orderly loops on the grid sites) r y a r h regular, predictable memory access pattern o m e m p information-exchange (processor<->memory) is huge i h c - however the size of the data-base is tiny n o
Compute intensive, you mean?? One Monte Carlo step is roughly the (real) time in which a (real) system flips one of its spins, roughly 1 pico-second If you want to understand what happens in just the first seconds of a real experiment you need O(10 12 ) time steps on ~ 100 replicas of a 100 3 system ---> 10 20 updates Clever programming on standard CPUs: 1 ns /spin-update ---> 3000 years
Compute intensive, you mean?? The dynamics is dramatically slow (see picture) So even a simulated box whose size is a small multiple of the corr. Length will give accurate physics results Good news: we're in business even if we simulate a very small box .... However ....
Hard scaling vs Weak Scaling Amdahl's law (strong scaling) vs... 1 − p p 1 S A = 1 − p p / N = 1 − p p / N … Gustafson's law (weak scaling) S G = 1 − p N p = 1 − p N p 1 − p p In our case … enlarging system-size is meaningless, as we do not yet have the resources to study a “small” system ----> the ultimate quest for strong scaling ....
The JANUS project An attempt at developing, building and operating an application- driven compute engine for Monte Carlo simulations of spin glass systems A collaboration of: Universities of Rome (La Sapienza) and Ferrara Universities of Madrid, Zaragoza, Badajoz BIFI (Zaragoza) Eurotech Partially supported by Microsoft, Xilinx
The nature of the available parallelism Spin – glass simulations have two levels of available parallelism 1) Embarassingly trivial: need statistics on several replicas ---> farm it out to independent processors 2) Trivially identified: sweep order for Monte Carlo update is not specified ---> can update in parallel any set of non-mutually interacting spins make it a black-white checkerboard: it opens the way to tens of thousands of independent thread... 1) & 2) do not commute
The ideal spin glass machine ..... A further question: what is the appropriate system-scale at which this parallelism is best exploited One update engine: U =− ∑ NB ij i J ij j computes the local contribution to U addresses a probability table compares with a freshly generated random numbr assigns the new spin value
The ideal spin glass machine ..... All this is just a bunch (~1000) of gates And in spite of that a typical CPU core, with O(10 7 +) gates can process perhaps 4 spins at each clock cycle If you can arrange your stock of gates the way it best suits the algorithm, can easily expect ~1000 update engines on one chip ----> The best structure is a massively-many-core organization ( or perhaps an application-driven GPU??)
The ideal spin glass machine ..... is an orderly structure (a 2D grid) of a large number of “update engines” each update engine handles a subset of the physical mesh its architectural structure is extremely simple each data path processess one bit at a time memory addresing is regular and predictable SIMD processing is OK however memory bandwidth requirements are huge (need 7 bit to process one bit..) however memory can be “local to the processor” Simple hardware structure ---> FPGA are OK!
The JANUS machine A parallel system of (themselves) massively parallel processor chips The basic hardware element: A 2-D grid of 4 x 4 (FPGA based) processors (SP's) Data links among nearest neighbours on the grid One control processors on each board (IOP) with 2 Gbit Ethernet links to host st
JANUS: a picture gallery
Our “large” machine 256 (16 x 16) processors 8 host PCs --> ~ 90 TIPS for spin-glass simulation A typical simulation wall-clock time on this nice little machine goes down to a more manageable ~ 100 days.
JANUS as a spin-glass engine The 2008 implementation (XILINX Virtex4-LX200): 1024 update cores on each processor, pipelineable to one spin update per clock cycle ---> 88% of available logic resources system clock at 62.5 Mhz ---> 16 ps average spin update time using a bandwidth of ~ 12000 read bits + 1000 written bits per clock cycle ---> 47% of available on-chip memory
(Measured) Performances Let's use “conventional” units, first ???? The data path of each Processing Element (PE) performs 11 + 2 sustained pipelined ops per clock cycle (62.5 Mhz) We have 1024 PEs ----> ~ 830 GIPS However 11 ops are on very short data words: more honestly: 7 ... 8 sustained “conventional” pipelined ops per clock cycle: We have 1024 PEs ----> ~ 300 GIPS ---> 10 GIPS/W Sustained by ~ 1 Tbyte/sec combined memory bandwidth
(Measured) Performances Physicicst like a different figure-of-merit ----> the spin-flip rate R, typically measured in psecs per flip For each processor in the system: 1 1 R 16ps / flip = = ≃ Nf 1024 × 62.5 MHz For one complete element of the IANUS core (16 procs): 1 1 . R = = ≃ 1 ps / flip . . N p Nf 16 × 1024 × 62.5 MHz e r u t a N s a t s a f s a
Physics results
Performance figures (2008-2009) Spin-glass addicts like to quote the average spin-update time SUT GUT ! ! Janus module 16 ps 1 ps x 0 0 7 – PC (IntelCoreDuo) 3000 ps 700 ps x 0 0 3 IBM CBE (all cores) - 65 ps
Performance figures (2010-2011) In the last couple of years, multi/many core processors and GPUs have entered the arena.... ! ! x 0 2 – x 0 1 l l i t S
Recommend
More recommend