toward a core design to distribute an execution on a
play

Toward a Core Design to Distribute an Execution on a Manycore - PowerPoint PPT Presentation

PaCT2015, Petrozavodsk, August 31 - September 4, 2015 Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune Universit e de Perpignan Via Domitia


  1. PaCT’2015, Petrozavodsk, August 31 - September 4, 2015 Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune Universit´ e de Perpignan Via Domitia DALI-LIRMM 1 / 33

  2. Summary. Parallelization of a C Code. 1 Automatic Hardware Parallelization. 2 Determinism. 3 Conclusion. 4 2 / 33

  3. Parallelization of a C Code. 3 / 33

  4. Example : a sum reduction. long sum( long t [ ] , unsigned i n t n) { i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n − n / 2 ) ; } This code looks sequential. Let us parallelize it. 4 / 33

  5. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } 5 / 33

  6. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } The code is multithreaded. 5 / 33

  7. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } The code is multithreaded. Threads executions are non deterministically ordered. 5 / 33

  8. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } The code is multithreaded. Threads executions are non deterministically ordered. Too few synchronization = > the result is not deterministic. 5 / 33

  9. Synchronized threads. typedef s t r u c t { unsigned long ∗ p ; unsigned long i ; } ST ; void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } 6 / 33

  10. Synchronized threads. typedef s t r u c t { unsigned long ∗ p ; unsigned long i ; } ST ; void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } Among all the run orderings, the synchronization keeps only good ones (i.e. computing the same result as a sequential execution). 6 / 33

  11. Synchronized threads. typedef s t r u c t { unsigned long ∗ p ; unsigned long i ; } ST ; void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } Among all the run orderings, the synchronization keeps only good ones (i.e. computing the same result as a sequential execution). Too much synchronization = > not parallel enough. 6 / 33

  12. Correctly synchronized threads. s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } 7 / 33

  13. What we propose to do. long sum( long t [ ] , unsigned i n t n) { i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n − n / 2 ) ; } 8 / 33

  14. What we propose to do. long sum( long t [ ] , unsigned i n t n) { i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n − n / 2 ) ; } This code is usually understood as sequential. 8 / 33

Recommend


More recommend