marwa a al shandawely
play

Marwa A. Al-Shandawely PDC/KTH Algorithm overview. Trivial - PowerPoint PPT Presentation

Marwa A. Al-Shandawely PDC/KTH Algorithm overview. Trivial parallelization. Problems. Sequential optimization Proposed solutions. Experimental results. Conclusions and future work. for i=1 to n-1 find pivotPos in


  1. Marwa A. Al-Shandawely PDC/KTH

  2.  Algorithm overview.  Trivial parallelization.  Problems.  Sequential optimization  Proposed solutions.  Experimental results.  Conclusions and future work.

  3. for i=1 to n-1 find pivotPos in column i if pivotPos ≠ i exchange rows(pivotPos,i) end if for j=i+1 to n A(i,j) = A(i,j)/A(i,i) end for j !$omp parallel lel do private te ( i ,j ) ) for j=i+1 to n+1 for k=i+1 to n A(k,j)=A(k,j)-A(k,i)×A(i,j) end for k end for j end for i

  4. 8 7 6 5 4 3 2 1 0 nThreads 2 3 4 5 6 7 8 N=1000 N=2000 N=3000 N=4000 N=5000

  5.  Poor data locality  Pivoting is done by master thread  Overheads of creating and destroying threads at each iteration  Sequential optimization

  6.  Replace division by the constant pivot  Avoid loop invariant access in the inner most loop  Eliminate the check for pivot changing position  Make use of fortran array notation Do k=j+1,n C=1/A(j,j) A(k,j)=A(k,j)/A(j,j) A(j+1:n)=A(j+1:n)*c End do

  7. Pivots array  Pivots array  Locks array  Pivot holder P 1 ◦ Eliminate (i) on column(i+1) P 2 ◦ Search (i+1) P 3 ◦ Store pivot (i+1) position P 4 ◦ Prepare colmn (i+1) ◦ Free lock (i+1) ◦ Eliminate (i) on rest of scope Locks

  8. 12 10 8 6 4 2 0 nThreads 1 2 3 4 5 6 7 N=1000 N=2000 N=3000 N=4000 N=5000

  9.  The original algorithm requires pivot columns to be prepared in order while the whole matrix is accessed for each pivot column.  For large input sizes; the cache is evicted many times for each iteration and there is no reuse of data in the cache.  False sharing on pivots and locks array.

  10.  Double elimination on pivot holders. ◦ Knowledge of two pivots allow data reuse.  Each column is an accumulation of eliminations using previous columns! ◦ Make more pivots available each step and eliminate each column using several pivots while it is in the cache.

  11. Pivots array  Block of pivots  Increase work/iter.  Increase locality P 1  Less locks P 2  Load balancing?! P 3 P 4 Locks

  12. Pivots array  Block of pivots  Increase work/iter.  Increase locality P 1  Less locks P 2  Load balancing?! P 3 P 4 Locks

  13. N=2000 N=5000 9 16 8 14 7 12 Original 6 C=1 10 5 C=2 8 4 C=3 6 3 C=4 4 2 C=5 2 1 0 0 2 3 4 5 6 7 8 2 3 4 5 6 7 8

  14. N=2000 N=5000 30 25 25 20 20 Original 15 15 double elimination 10 10 C=25 with double elimination 5 5 0 0 2 3 4 5 6 7 8 2 3 4 5 6 7 8

  15.  Scalable performance on multicores is highly dependent on application implementation, data layout and access patterns.  Cache and memory access optimization techniques is vital for performance despite the loss of readability.  Future work: ◦ Adaptive blocking scheme that changes the block size as a function of the matrix size, cache settings, and number of cores.

Recommend


More recommend