Efficient Correlation-Free Many-States Lattice Monte Carlo on GPUs Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming 8th May 2017 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
1 Introduction: What is this talk about? surface growth, physical aging (and non-equilibrium systems) lattice Monte-Carlo y p q x 2 Trivial parallism vs. SIMT Page 1/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Applications for Monte Carlo: Stochastic Prosesses http://en.wikipedia.org/wiki/File: game theory Rub_al_Khali_002.JPG http://hubblesite.org/newscenter/ e. g.: Perc, Matjaž Eur. J. Phys. archive/releases/2007/17/image/a 38 (4) 045801 (2017) sociology finance ... https://www.hzdr.de/db/Cms?pOid= 24344&pNid=2707 Müller, T., Heinig, K.-H. et al. Appl. Phys. Lett. 85 2373 (2004) Page 2/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Non-Equilibrium vs Equilibrium Equilibrium Properties: out-of-Equilibrium: only final state relevant kinetics of interest ? ? ? ? disordered state ordered state J 8 -states Potts model, kBT = 5 8 -states Potts model optimal algorithm reproduces optimal algorithm reaches physical evolution equilibrium quickly Page 3/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Non-Equilibrium Systems � � L 2 L 2 � � W 2 ( L, t ) = 1 h 2 i ( t ) − h i ( t ) L 2 i i 10 2 L � = lateral systemsize Interface Roughness W 2 = surface height at site i h i � 10 1 150 . 05 M MCS 10 0 20 . 5 M MCS 10 1 10 2 10 3 10 4 10 5 10 6 10 7 t [ Monte Carlo steps (MCS) ] 0 . 6 M MCS 0 . 1 M MCS Page 4/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Domain Decomposition Stochastic Cellular Automaton Random Sequential (RS) on GPU: domain decomposition (SCA) 1 2 1 2 4 3 4 3 1 2 1 2 4 3 4 3 update odd/even sublattice + uncorrelated updates update probability p < 1 − < 48 B per domain in smem + linear memory access ⇒ fast Page 5/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Parallel random sequential updates are hard. Why should we care for them? Page 6/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Auto-Correlation of a Lattice Gas 10 1 C ( t, s ) = � φ ( t ) φ ( s ) �−� φ ( t ) � � φ ( s ) � t, s : time, waiting-time 10 0 10 − 1 C ( t, s ) · s 0 . 76 10 − 2 10 − 3 10 − 4 10 − 5 Random Sequential 10 − 6 10 0 10 1 10 2 10 3 t/s Page 7/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Auto-Correlation of a Lattice Gas 10 1 10 0 10 − 1 C ( t, s ) · s 0 . 76 10 − 2 10 − 3 10 − 4 SCA − limit (correction) Checkerboard SCA 10 − 5 Random Sequential 10 − 6 10 0 10 1 10 2 10 3 t/s Page 7/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
KPZ–Equation for Surface Growth 10 2 Interface Roughness W 2 y p q x 10 1 2 β eff 10 0 2 + 1 D octahedron model 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Ódor, G., Liedke, B., Heinig, K.-H. Phys. Rev. E t [ Monte Carlo steps (MCS) ] 79 021125 (2009) + λ [ ∇ h ( x , t )] 2 + σ 2 ∇ 2 h ( x , t ) d t h ( x , t ) = v + η ( x , t ) ���� � �� � � �� � � �� � mean growth vel. surface tension local growth vel. noise Kardar–Parisi–Zhang stochastic differential equation Kardar, M., Parisi, G., Zhang, Y.-C. Phys. Rev. Lett. 56 889 (1986) Page 8/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
β and the Kim–Kosterlitz Hypothesis β = 1 / 4 ? Kim, J. M., Kosterlitz, J. M. Phys. Rev. Lett. 62 2289 (1989) octahedron model restricted solid-on-solid model ∆ h = ± 1 ∆ h ≤ N β ≈ 1 / 4 for N > 1 ? β < 1 / 4 0.246 12 0.244 13 16 17 0.242 β eff 0.24 0.238 0.236 0 0.02 0.04 0.06 0.08 1/2 1/t Kelling, J., Ódor, G. Phys. Rev. E 84 061150 (2011) Kim, J. M. J. Korean Phys. Soc. 67 (9) 1529 (2015) We need more states. Page 9/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Part 2 Trivial parallism vs. SIMT Handling more states. Page 10/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Trivial parallism vs. SIMT efficient simulation of independent copies vector of 32 , . . . , 128 , 256 , . . . layers . depending on application . . ⇒ “random” accesses to vectors in global memory ⇒ no caching of simulation state required ⇒ very efficient use of GPUs ⇒ (vector processors/data parallelism) Ito, N., Kanada, Y. Supercomputer 3 (25) 1988 Page 11/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Trivial parallism vs. SIMT efficient simulation of independent copies Trivially parallel → Multi-Surface . . . �→ large samples ⇒ good statistics �→ large parameter studies �→ large sets of initial conditions + random site-selection Ito, N., Kanada, Y. Supercomputer 3 (25) 1988 Page 11/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Multi-Surface Approach for GPUs 1 2 1 2 4 3 4 3 . . . 4 3 4 3 1 2 1 2 1 2 1 2 4 3 4 3 4 3 4 3 double-tiling at device layer ... with random origin Multi-Surface at block layer global memory ... multi-processor 1 multi-processor N shared memory , up to 48 kB shared memory , up to 48 kB ... ... thread 1 thread M thread 1 thread M sync sync sync sync sync Page 12/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Decorrelating Samples random site-selection is about introducing uncorrelated noise we want to average over independent samples domain growth, phase ordering: structure evolution random initial conditions independent random update acceptance (Boltzmann factors exp ∆ E/k B T ) (quenched disorder) ⇒ no problem surface growth flat initial conditions ⇒ all simulations with identical site-selection would be identical randomly discard every 2nd update Page 13/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Not Decorrelating Samples Cases where identical noise across samples is desirable: sampling initial conditions calculating response functions * parallel annealing Page 14/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
RSOS Multi-Surface 8 bits per lattice-site are enough ⇒ process 4 packed samples per thread 4 bits per height-difference word 0 ≡ thread 0 word 1 ≡ thread 1 � �� � � �� � sample 0 sample 1 sample 2 sample 3 sample 4 sample 5 sample 6 sample 7 � �� � � �� � � �� � � �� � � �� � � �� � � �� � � �� � ( x, y ) ( x, y ) ( x, y ) ( x, y ) ( x, y ) ( x, y ) ( x, y ) ( x, y ) . . . randomly select 2 out of 4 samples for each thread ⇒ no idle threads Page 15/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Collective Generation of Random Coordinates all threads access the same coordinate for each update ⇒ pre-compute list of update coordinates in shared memory each thread computes one component: 1 generate random number 2 apply transformations (origin shift, periodic boundary conditions) collectively refill list when used up Page 16/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Performance 229 200 update attempts/ns bit-coded multi-surface number of states � 4 any number of states 100 large systems large samples 50 11 9 7 4 . 5 0 Octahedron RS Octahedron Octahedron RSOS RS Potts RS Potts RS SCA p = 0 . 95 SCA p = 0 . 5 Kawasaki Page 17/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Memory Limits: RSOS single-GPU implementations 64 threads per block ⇒ 256 samples ⇒ 256 B / MS lattice site ⇒ 2 12 × 2 12 sites need 4 GB of gmem + random number generator states Page 18/22 Member of the Helmholtz Association Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de
Recommend
More recommend