CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 An Analytical Study of GPU Computation for Solving QAPs by Parallel Evolutionary Computation with Independent Run Shigeyoshi Tsutsui Hannan Univ., JAPAN Noriyuki Fujimoto Osaka Prefecture Univ., JAPAN 1
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Outline of This Talk • Background of the research • Effect of parallel independent run on GPU • Quadratic Assignment Problem (QAP) • Implementation Details on GPU • Results • Analytical study • Conclusions and Future Work 2
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Background • In a previous study (CIGPU 2009), we applied GPU computation to solve quadratic assignment problems (QAPs) with parallel EC on a single GPU • The results in that study showed that parallel EC with the GTX285 GPU produce a speedup of x3 to x12 compared to the i7 965 (3.2 GHz) • However, the analysis of the results was postponed for future work • In this study, we propose a simplified parallel EC model and analyze how the speedup is obtained using a statistical model of parallel runs of the algorithm
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Parallel EC Models Master-Slave Model Master Slave Slave Slave Coarse-grained Model (Distributed EC) Fine-grained Model Hybrid Model Individual-level Model
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Parallel EC Model on GPU • Parallel Independent Run Model – A variant of the coarse-grained model – Gives a lower bound performance of the coarse-grained model – Each sub-population runs on each MP independently – On an MP, individual level parallel run is performed 1 2 p sub-populations Multi-Processor (30) Shared Memory (SM) Shared Memory (SM) Shared Memory (SM) Shared Memory (SM) Shared Memory (SM) Shared Memory (SM) TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP TP VRAM (Global Memory)
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Effect of Parallel Independent Run Sequential run Run time T avg Parallel independent run Run time Obviously, T avg > T p , avg
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Quadratic Assignment Problem (QAP) • One of the hardest combinatorial optimization problem • Problem size is at most 150 • Given l locations and l facilities, the task is to assign the facilities to the locations to minimize the cost – For each pair of locations i and j , the distance is d ij – For each pair of facilities r and s , the flow is f rs – The cost is defined as: l l ( ) cost f ij d ( i ) ( j ) i 1 j 1 7
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 An Example of QAP ( l =4) 44 location 1 facility 1 facility 4 2 5 location 4 11 30 9 21 3 location 2 10 4 6 facility 2 facility 3 12 location 3 an assignment location facility 1 2 3 4 1 2 3 4 1 2 3 4 1 0 5 10 2 1 0 21 11 44 = 2 1 4 3 location facility 2 5 0 6 3 2 21 0 12 30 4 4 3 10 6 0 4 3 11 12 0 9 cost ( ) f ij d ( i ) ( j ) i 1 j 1 4 2 3 4 0 4 44 30 9 0 1524 distance matrix d ij flow matrix f rs 8
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 The Base EC Model of a Sub-population • We use population pool P and working pool W • Each individual i ( i =1,2,…, N ) is processed independently of other individuals. • Re-initialize if number of individuals which have current best functional value is greater than N *0.6 Apply Crossover and mutation P I 1 I i I 2 better Select another parent I N Pair wise randomly selection W I 1 ' I 2 ' I N ' 9
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Implementation Details on GPUs MP 1 MP 30 Assume problem shared memory(16KB ) shared memory(16KB ) size at most 56 subpop size N =128 String is array of unsigned char TP TP TP TP TP TP TP TP Check or set TP TP TP TP TP TP TP TP solution was found Foundflag=0 constant memory VRAM Constant data for QAP(f ij , d ij ) 10
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Experimental Conditions CPU Intel Core i7 965 NVIDIA GeForce GTX285 (240 procs, VRAM 1GB) × 2 GPU OS Windows XP Compiler Visual Studio 2005 with /O2 SDK CUDA 2.3 Number of runs 30 tai25b, kra30a, kra30b, tai30b, kra32, tai35b, ste36b, tai40b , Problem instances tai50b from QAPLIB 11
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 The run time gain obtained by p -block parallel runs to single block runs • The values of gain are different from instance to instance – They are in the range [10, 35] for p = 30, and [10, 70] for p = 60, – and are nearly proportional to p , except for some instances 80 70 60 50 1GPU (p=30) gain 40 2GPUs (p=60) 30 20 10 0 tai25b kra30a kra30b tai30b kra32 tai35b ste36b tai40b tai50b QAP Instances
CEC/CIGPU 201 CEC/CIGPU 2010, 0, Barcelona Barcelona, July 2010 Run Time Estimation of Independent Parallel Run (1) Sequential run f ( t ), F ( t ) Parallel independent run g ( p , t ), G ( p , t ) p
CEC/CIGPU 2010, CEC/CIGPU 201 0, Barcelona Barcelona, July 2010 Run Time Estimation of Independent Parallel Run (2) Run time of parallel independen t run with p blocks t F ( t ) f ( t ) dt t 0 p G ( p , t ) 1 ( 1 F ( t )) Gain obtained by parallel independen t run with p blocks ( 1 ) M d Gain g ( p , t ) G ( p , t ) p M ( p ) dt p 1 p ( 1 F ( t )) f ( t ) t f ( t ) dt 0 t p 1 p 1 M ( p ) t p ( 1 F ( t )) f ( t ) dt t p ( 1 F ( t )) f ( t ) dt t 0 t 0 Run time with single block run M ( 1 ) t f ( t ) dt t 0
CEC/CIGPU 2010, CEC/CIGPU 201 0, Barcelona Barcelona, July 2010 Run time distribution on a single blocktime (sectai25b kra30a kra30b tai30b kra32 tai35b ste36b tai40b tai50b QAP Instances
Recommend
More recommend