the three beaars
play

The Three Beaars . . . . The Three Beaars Dyalog 12 . - PowerPoint PPT Presentation

. Snake Island Research Inc Robert Bernecky October 10, 2012 bernecky@snakeisland.com tel: +1 416 203 0854 Toronto, Canada 18 Fifth Street, Wards Island Robert Bernecky . Basically, Every Array Allocation Reduces Speed The Three Beaars


  1. . Why are arrays expensive? Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . ◮ Consider the cost to perform ZûX+Y : ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y : 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops ◮ Decrement reference counts on X,Y : 50ops ◮ Deallocate old Z , if any: 100ops ◮ Assign Zûtemp : 50ops ◮ TOTAL: 1150ops ◮ vs. compiled scalar code: 10ops

  2. . . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Traditional optimization methods: CSE, VP, CP, etc. Allocate scalars on stack, instead of heap Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars

  3. . . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Allocate scalars on stack, instead of heap Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc.

  4. . . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap

  5. . . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code

  6. . . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter

  7. . . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter ◮ Experimental platform: AMD 1075T 6-core CPU, 3.2GHz

  8. . . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter ◮ Experimental platform: AMD 1075T 6-core CPU, 3.2GHz ◮ (cheap ASUS M4A88T-M desktop machine)

  9. . . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! Lesson: Interpreters dislike scalar-dominated algorithms Dynamic programming ( string shuffle): >1000X speedup Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec The Three Beaars – Dyalog ’12 . Baby Bear Problem: Floyd-Warshall Algorithm . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph

  10. . . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! Lesson: Interpreters dislike scalar-dominated algorithms Dynamic programming ( string shuffle): >1000X speedup The Three Beaars – Dyalog ’12 Baby Bear Problem: Floyd-Warshall Algorithm . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec

  11. . . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! Lesson: Interpreters dislike scalar-dominated algorithms The Three Beaars – Dyalog ’12 Baby Bear Problem: Floyd-Warshall Algorithm . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup

  12. . . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! The Three Beaars – Dyalog ’12 . Baby Bear Problem: Floyd-Warshall Algorithm . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms

  13. . . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! The Three Beaars – Dyalog ’12 Baby Bear Problem: Floyd-Warshall Algorithm . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved!

  14. . Baby Bear Problem: Floyd-Warshall Algorithm Robert Bernecky What about array-based solutions? It’s papa bear time! . The Three Beaars – Dyalog ’12 . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved! ◮ But, no parallelism: Adding threads just makes it slower!

  15. . Baby Bear Problem: Floyd-Warshall Algorithm Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved! ◮ But, no parallelism: Adding threads just makes it slower! ◮ What about array-based solutions? It’s papa bear time!

  16. . SAC: Scholz & Bernecky ( Classic matmul variant) Robert Bernecky A ”with-loop” is a nested data-parallel FORALL loop return( res); : modarray(D); min( D[i,j], minval( D[i] + DT[j])); (. <= [i,j] <= .) : res = with DT = transpose(D); inline int[.,.] floydSbs1(int[.,.] D ) end.';'y') . floyd=: 3 : Array-based Floyd-Warshall Algorithm . . . . The Three Beaars – Dyalog ’12 ◮ j64-602, from J Essays ( CDC STAR APL algorithm variant) ('for j. i.#y';'do. y=. y <. j ( { "1 +/ { ) y';'

  17. . . Robert Bernecky A ”with-loop” is a nested data-parallel FORALL loop return( res); : modarray(D); min( D[i,j], minval( D[i] + DT[j])); (. <= [i,j] <= .) : res = with DT = transpose(D); The Three Beaars – Dyalog ’12 end.';'y') floyd=: 3 : Array-based Floyd-Warshall Algorithm . . . . ◮ j64-602, from J Essays ( CDC STAR APL algorithm variant) ('for j. i.#y';'do. y=. y <. j ( { "1 +/ { ) y';' ◮ SAC: Scholz & Bernecky ( Classic matmul variant) inline int[.,.] floydSbs1(int[.,.] D ) { }

  18. . end.';'y') Robert Bernecky return( res); : modarray(D); min( D[i,j], minval( D[i] + DT[j])); (. <= [i,j] <= .) : res = with DT = transpose(D); . The Three Beaars – Dyalog ’12 floyd=: 3 : Array-based Floyd-Warshall Algorithm . . . . ◮ j64-602, from J Essays ( CDC STAR APL algorithm variant) ('for j. i.#y';'do. y=. y <. j ( { "1 +/ { ) y';' ◮ SAC: Scholz & Bernecky ( Classic matmul variant) inline int[.,.] floydSbs1(int[.,.] D ) { } ◮ A ”with-loop” is a nested data-parallel FORALL loop

  19. . Lesson: Array-based code and optimizers are good for you Robert Bernecky Figure: Floyd-Warshall Performance . The Three Beaars – Dyalog ’12 Array-based Floyd-Warshall Algorithm Speedup . . . . APEX/SAC (18091) vs. J Performance 2,012−07−20 45x 7.6s Floyd−Warshall Shortest Path benchmark 3000x3,000 40x 8.5s J 8.9s 35x −mt 1 9.7s 11.5s −mt 2 30x Speedup −mt 3 12.5s 25x −mt 4 14.8s 14.2s −mt 5 20x −mt 6 17.9s 15x 22.2s 26.9 10x 50.3s ( J Essays alg for J) 5x 306s103s 306s 306s 0x FOR−loop sbs/rbe−WLF sbs/rbe−AWLF Algorithm

  20. . Loop fusion transforms this into: Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Benefit: Reduced memory subsystem traffic Benefit: Array-valued tmp removed (DCR) (V2[j] * V3[j]); Z[j] = V1[j] + for( j=0; j<n; j++) The Three Beaars – Dyalog ’12 . Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; }

  21. . . Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Benefit: Reduced memory subsystem traffic Benefit: Array-valued tmp removed (DCR) Z[j] = V1[j] + The Three Beaars – Dyalog ’12 . Loop Fusion . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); }

  22. . . Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Benefit: Reduced memory subsystem traffic Z[j] = V1[j] + The Three Beaars – Dyalog ’12 Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR)

  23. . . Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Z[j] = V1[j] + The Three Beaars – Dyalog ’12 Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic

  24. . . Robert Bernecky Benefit: Improved parallelism, in some compilers Z[j] = V1[j] + The Three Beaars – Dyalog ’12 Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic ◮ Benefit: Reduced loop overhead

  25. . Loop Fusion Robert Bernecky Z[j] = V1[j] + . The Three Beaars – Dyalog ’12 . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic ◮ Benefit: Reduced loop overhead ◮ Benefit: Improved parallelism, in some compilers

  26. . . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) Handles Arrays of Known Shape (AKS) only AWLF (R. Bernecky) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion

  27. . . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) AWLF (R. Bernecky) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only

  28. . . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky)

  29. . . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky) ◮ Handles AKS arrays & Arrays of Known Dimension (AKD)

  30. . With-Loop Folding (WLF) and Algebraic With-Loop Robert Bernecky . Folding (AWLF) The Three Beaars – Dyalog ’12 . . . . ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky) ◮ Handles AKS arrays & Arrays of Known Dimension (AKD) ◮ Acoustic signal processing (delta modulation): logdû{¢50Ó50Ä50«(DIFF 0,×)ß0.01+×} DIFFû{¢1Õ×-¢1÷×}

  31. . 3.2s n/a -nowlf -O3 10.7s 5.5s 1.9X -doawlf -O3 0.7s 7.8s 4.5X Speedup 3.3X 7.8X 15X Robert Bernecky n/a APL . sec . . . . WLF/AWLF example: Acoustic Signal Processing Sixteen with-loops are folded into two WLs! WLF/AWLF increase available parallelism sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec The Three Beaars – Dyalog ’12 ◮ logd on 200E6-element double-precision vector

  32. . 3.2s n/a -nowlf -O3 10.7s 5.5s 1.9X -doawlf -O3 0.7s 7.8s 4.5X Speedup 3.3X 7.8X 15X Robert Bernecky n/a APL . sec . . . . WLF/AWLF example: Acoustic Signal Processing WLF/AWLF increase available parallelism sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec The Three Beaars – Dyalog ’12 ◮ logd on 200E6-element double-precision vector ◮ Sixteen with-loops are folded into two WLs!

  33. . 3.2s n/a -nowlf -O3 10.7s 5.5s 1.9X -doawlf -O3 0.7s 7.8s 4.5X Speedup 3.3X 7.8X 15X Robert Bernecky n/a APL . sec . . . . WLF/AWLF example: Acoustic Signal Processing The Three Beaars – Dyalog ’12 sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec ◮ logd on 200E6-element double-precision vector ◮ Sixteen with-loops are folded into two WLs! ◮ WLF/AWLF increase available parallelism

  34. . A = with ([0] <= iv < [4]) Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) WLS transforms this into: genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) WLS to merge loop-nest pairs, forming a single WL . non-scalar cells Operates on nested-WLs in which inner loop creates Trojahner) With-Loop Scalarization (WLS) . . . . The Three Beaars – Dyalog ’12 ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K.

  35. . A = with ([0] <= iv < [4]) Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) WLS transforms this into: genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) WLS to merge loop-nest pairs, forming a single WL . non-scalar cells Trojahner) With-Loop Scalarization (WLS) . . . . The Three Beaars – Dyalog ’12 ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates

  36. . . Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) WLS transforms this into: genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) The Three Beaars – Dyalog ’12 non-scalar cells Trojahner) With-Loop Scalarization (WLS) . . . . ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates ◮ WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) { }

  37. . . Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) The Three Beaars – Dyalog ’12 non-scalar cells Trojahner) With-Loop Scalarization (WLS) . . . . ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates ◮ WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) { } ◮ WLS transforms this into:

  38. . non-scalar cells Robert Bernecky genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) . The Three Beaars – Dyalog ’12 Trojahner) With-Loop Scalarization (WLS) . . . . ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates ◮ WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) { } ◮ WLS transforms this into: ◮ Mandatory for good performance: array-valued temps removed

  39. . From Sven-Bodo Scholz: With-Loop-Folding in Sac Robert Bernecky . A good argument for Ken Iverson’s mask verb! The Three Beaars – Dyalog ’12 WLF/AWLF/WLS example: Poisson 2-D Relaxation Kernel . . . . zûrelax A;m;n. . . mû(ÒA)[0] nû(ÒA)[1] Bû((1áA)+(¢1áA)+(1÷A)+(¢1÷A))ß4 upperAû(1,n)ÙA lowerAû((m-1),0)ÕA leftAû1 0Õ((m-1),1)ÙA rightAû((m-2),1)Ù(1,n-1)ÕA innerBû((m-2),n-2)Ù1 1ÕB middleûleftA,innerB,rightA zûupperA¬middle¬lowerA

  40. . . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s APEX/SAC 18091, single-thread: CPU time = 3.65s APEX/SAC 18091: multi-threaded (no source code changes!) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop!

  41. . . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes APEX/SAC 18091, single-thread: CPU time = 3.65s APEX/SAC 18091: multi-threaded (no source code changes!) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s

  42. . . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes APEX/SAC 18091: multi-threaded (no source code changes!) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s ◮ APEX/SAC 18091, single-thread: CPU time = 3.65s

  43. . . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s ◮ APEX/SAC 18091, single-thread: CPU time = 3.65s ◮ APEX/SAC 18091: multi-threaded (no source code changes!)

  44. . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes Robert Bernecky Figure: APEX vs. APL CPU time performance . The Three Beaars – Dyalog ’12 . . . . APEX/SAC (18091) vs. Dyalog APL 13.0 Performance 2,012−07−19 70x 6−core AMD Phenom II X6 1075T, 3.2GHz 3.48s 3.5s 0.83s 60x 0.89s APL 4.26s 50x −mt 1 1.08s Speedup 5.2s −mt 2 1.29s 40x −mt 3 −mt 4 22.7s 7.5s 30x 1.85s −mt 5 23.6s 25.9s −mt 6 23.8s 20x 29.9s 14.7s 3.65s 42.1s 10x 47.4s 204s 418s 0x 250x250,20K 500x500,20K 10Kx10K,100 Problem size, iteration count

  45. . Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s Memory subsystem bandwidth: 4464MB/s . APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint Poisson 2-D Relaxation: Memory footprint . . . . The Three Beaars – Dyalog ’12 ◮ Why poor speedup on 10Kx10K test?

  46. . Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s Memory subsystem bandwidth: 4464MB/s . APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint Poisson 2-D Relaxation: Memory footprint . . . . The Three Beaars – Dyalog ’12 ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint

  47. . Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s Memory subsystem bandwidth: 4464MB/s . Poisson 2-D Relaxation: Memory footprint . . . . The Three Beaars – Dyalog ’12 ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint

  48. . Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s The Three Beaars – Dyalog ’12 . Poisson 2-D Relaxation: Memory footprint . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s

  49. . . Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo The Three Beaars – Dyalog ’12 Poisson 2-D Relaxation: Memory footprint . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s

  50. . . Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system The Three Beaars – Dyalog ’12 . Poisson 2-D Relaxation: Memory footprint . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo

  51. . . Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. system The Three Beaars – Dyalog ’12 . Poisson 2-D Relaxation: Memory footprint . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo ◮ Scholz sees linear speedup on 48-core system

  52. . . Robert Bernecky Lesson: Array optimizations are VERY good for you. system The Three Beaars – Dyalog ’12 Poisson 2-D Relaxation: Memory footprint . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo ◮ Scholz sees linear speedup on 48-core system ◮ Lesson: High memory bandwidth is good for you.

  53. . Poisson 2-D Relaxation: Memory footprint Robert Bernecky system . The Three Beaars – Dyalog ’12 . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo ◮ Scholz sees linear speedup on 48-core system ◮ Lesson: High memory bandwidth is good for you. ◮ Lesson: Array optimizations are VERY good for you.

  54. . Why is interpreted APL faster than compiled code for some tests? Robert Bernecky Figure: APEX vs. APL CPU time performance . The Three Beaars – Dyalog ’12 Mama Bear Motivation . . . . Speedup (APL/APEX w/AWLF) 1,000 100 0.1 10 1 buildvAKS APL: Dyalog APL 13.0 SAC: 18,221:MODIFIED Higher is better for APEX buildvfAKS buildv2AKS compiotaAKS compiotadAKS csbenchAKS downgradePVAKS fdAKS floydAKS gewlfAKS histgradeAKS APL vs. APEX CPU Time Performance (2,012−09−15) histlpAKS histopAKS histopfAKS iotanAKS ipapeAKS ipbbAKS ipbdAKS ipddAKS ipopneAKS ipplusandAKS lltopAKS Benchmark name logdAKS logd2AKS logd3AKS logd4AKS loopfsAKS loopfvAKS loopisAKS matiotaAKS mconvAKS nsvAKS nthoneAKS poissonAKS primesAKS schedrAKS scsAKS sdyn4AKS snpAKS testforAKS testindxAKS testlcvAKS tjckrbeAKS tjck2AKS ulamAKS unirandAKS upgradeBoolAKS upgradeCharAKS upgradeIntVecAKS upgradePVAKS upgradeRPVAKS waverAKS

  55. . . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: Shape vector generation for variable result shapes Generation of small arrays, e.g. , complex scalars No SaC FOR-loop analog to with-loop Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign

  56. . . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: Generation of small arrays, e.g. , complex scalars No SaC FOR-loop analog to with-loop Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes

  57. . . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: No SaC FOR-loop analog to with-loop Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes ◮ Generation of small arrays, e.g. , complex scalars

  58. . . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes ◮ Generation of small arrays, e.g. , complex scalars ◮ No SaC FOR-loop analog to with-loop

  59. . . . . . . Mama Bear - Small Array Scalarization Optimization: Primitive Function Unrolling (Classic) Optimization: Index Vector Elimination (IVE) ( sacdev) 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form

  60. . . . . . . Mama Bear - Small Array Scalarization Optimization: Index Vector Elimination (IVE) ( sacdev) 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic)

  61. . . . . . . Mama Bear - Small Array Scalarization 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic) ◮ Optimization: Index Vector Elimination (IVE) ( sacdev)

  62. . . . . . . Mama Bear - Small Array Scalarization 2–16X speedup observed Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic) ◮ Optimization: Index Vector Elimination (IVE) ( sacdev) ◮ Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky)

  63. . complex z Robert Bernecky while( zr * zr + zi * zi <= 4.0) int calc( double zr, double zi, int maxdepth) mandelbrot opt : Hand-scalarized - pair of scalars z[1] imag(z) z[0] real(z) double(2) z Complex scalars, under the covers: . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) int calc( complex z, int maxdepth) mandelbrot : Uses complex numbers Mama Bear - Small Array Scalarization . . . . The Three Beaars – Dyalog ’12 ◮ Mandelbrot set computation performance

  64. . complex z Robert Bernecky while( zr * zr + zi * zi <= 4.0) int calc( double zr, double zi, int maxdepth) mandelbrot opt : Hand-scalarized - pair of scalars z[1] imag(z) z[0] real(z) double(2) z Complex scalars, under the covers: . Mama Bear - Small Array Scalarization . . . . The Three Beaars – Dyalog ’12 ◮ Mandelbrot set computation performance ◮ mandelbrot : Uses complex numbers int calc( complex z, int maxdepth) { . . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) . . .

  65. . . Robert Bernecky while( zr * zr + zi * zi <= 4.0) int calc( double zr, double zi, int maxdepth) mandelbrot opt : Hand-scalarized - pair of scalars The Three Beaars – Dyalog ’12 . Mama Bear - Small Array Scalarization . . . ◮ Mandelbrot set computation performance ◮ mandelbrot : Uses complex numbers int calc( complex z, int maxdepth) { . . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) . . . ◮ Complex scalars, under the covers: complex z ↔ double(2) z real(z) ↔ z[0] imag(z) ↔ z[1]

  66. . Mama Bear - Small Array Scalarization Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . ◮ Mandelbrot set computation performance ◮ mandelbrot : Uses complex numbers int calc( complex z, int maxdepth) { . . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) . . . ◮ Complex scalars, under the covers: complex z ↔ double(2) z real(z) ↔ z[0] imag(z) ↔ z[1] ◮ mandelbrot opt : Hand-scalarized - pair of scalars int calc( double zr, double zi, int maxdepth) { . . . while( zr * zr + zi * zi <= 4.0) . . .

  67. . 21.9s 28.1s 23.0s 19.8s mandelbrot on 69.9s 46.1s 34.6s 28.1s 23.0s mandelbrot opt 48.4s on 70.7s 46.7s 34.7s 28.2s 22.9s 19.6s Lesson: No more suffering for being elegant Well, less suffering for being elegant Robert Bernecky 35.2s 71.8s . off . . . . Mama Bear - Small Array Scalarization Test Opts -mt 1 -mt 2 -mt 3 -mt 4 -mt 5 -mt 6 mandelbrot off 1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s mandelbrot opt The Three Beaars – Dyalog ’12 ◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled

  68. . 23.0s 35.2s 28.1s 23.0s 19.8s mandelbrot on 69.9s 46.1s 34.6s 28.1s 21.9s . mandelbrot opt on 70.7s 46.7s 34.7s 28.2s 22.9s 19.6s Well, less suffering for being elegant Robert Bernecky 48.4s 71.8s off mandelbrot opt . . . . Mama Bear - Small Array Scalarization Test Opts -mt 1 -mt 2 -mt 3 -mt 4 -mt 5 -mt 6 mandelbrot off 1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s The Three Beaars – Dyalog ’12 ◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled ◮ Lesson: No more suffering for being elegant

  69. . 28.1s 35.2s 28.1s 23.0s 19.8s mandelbrot on 69.9s 46.1s 34.6s 23.0s . 21.9s mandelbrot opt on 70.7s 46.7s 34.7s 28.2s 22.9s 19.6s Robert Bernecky 48.4s 71.8s off mandelbrot opt . . . . Mama Bear - Small Array Scalarization Test Opts -mt 1 -mt 2 -mt 3 -mt 4 -mt 5 -mt 6 mandelbrot off 1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s The Three Beaars – Dyalog ’12 ◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled ◮ Lesson: No more suffering for being elegant ◮ Well, less suffering for being elegant . . .

  70. . . . . . . GPU (CUDA) Support Without Suffering Physics experiment Robert Bernecky The Three Beaars – Dyalog ’12 ◮ SaC generates CUDA code automatically: -target cuda

  71. . GPU (CUDA) Support Without Suffering Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . ◮ SaC generates CUDA code automatically: -target cuda ◮ Physics experiment LatticeBoltzmann CUDA vs. SaC Speedups (8800GT) 70 10 Steps 25 Steps 50 Steps 60 100 Steps 200 Steps 50 Speedup 40 30 20 10 0 256 384 512 640 768 896 1024 1152 1280 1408 1536 Problem Size

  72. con(dc[iv],fi,disclose NDV(tr)); . pt=trace++(filter*0.0); NB. No overtake in SAC Robert Bernecky Performance is so-so: Optimistic optimizations required con: matmul(fi,take(shape(fi),drop([dc],tr))) : genarray(shape(dc),0.0); ( . <= iv <= .) : convn: z=with z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); nested double NDS; . nested double[.] NDV; SAC convolution kernel using EACH : APL convolution kernel using EACH : Goldilocks - Nested Arrays in APEX/SAC . . . . The Three Beaars – Dyalog ’12 ◮ Nested arrays are alive and living in SAC! (R. Douma)

  73. con(dc[iv],fi,disclose NDV(tr)); . nested double NDS; Robert Bernecky Performance is so-so: Optimistic optimizations required con: matmul(fi,take(shape(fi),drop([dc],tr))) : genarray(shape(dc),0.0); ( . <= iv <= .) : convn: z=with z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); pt=trace++(filter*0.0); NB. No overtake in SAC nested double[.] NDV; . SAC convolution kernel using EACH : Goldilocks - Nested Arrays in APEX/SAC . . . . The Three Beaars – Dyalog ’12 ◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH : convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×}

  74. . . Robert Bernecky Performance is so-so: Optimistic optimizations required con: matmul(fi,take(shape(fi),drop([dc],tr))) con(dc[iv],fi,disclose NDV(tr)); pt=trace++(filter*0.0); NB. No overtake in SAC nested double NDS; nested double[.] NDV; The Three Beaars – Dyalog ’12 Goldilocks - Nested Arrays in APEX/SAC . . . . ◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH : convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×} ◮ SAC convolution kernel using EACH : z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with { ( . <= iv <= .) : } : genarray(shape(dc),0.0);

  75. . . Robert Bernecky con: matmul(fi,take(shape(fi),drop([dc],tr))) con(dc[iv],fi,disclose NDV(tr)); pt=trace++(filter*0.0); NB. No overtake in SAC nested double NDS; nested double[.] NDV; The Three Beaars – Dyalog ’12 Goldilocks - Nested Arrays in APEX/SAC . . . . ◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH : convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×} ◮ SAC convolution kernel using EACH : z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with { ( . <= iv <= .) : } : genarray(shape(dc),0.0); ◮ Performance is so-so: Optimistic optimizations required

  76. . up to 10X small developing up to 20X enables other opts Papa large nearly done 2X-50X none All optimizations are critical for getting excellent performance Array-based algorithms will win, and scale well Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky Mama up to 1300X . Array . . . . Summary and Future Work Bear Optimizers mature Serial Parallel size speedup speedup Baby scalars The Three Beaars – Dyalog ’12 ◮ Status:

  77. . nearly done small developing up to 20X enables other opts Papa large up to 10X none 2X-50X Array-based algorithms will win, and scale well Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky Mama up to 1300X . mature . . . . Summary and Future Work Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance

  78. . large Mama small developing up to 20X enables other opts Papa nearly done . up to 10X 2X-50X Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky none up to 1300X mature Array . . . . Summary and Future Work scalars Bear Optimizers Serial Parallel size speedup speedup Baby The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well

  79. . large Mama small developing up to 20X enables other opts Papa nearly done . up to 10X 2X-50X Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky none up to 1300X mature Array . . . . Summary and Future Work scalars Bear Optimizers Serial Parallel size speedup speedup Baby The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work

  80. . Papa none Mama small developing up to 20X enables other opts large mature nearly done up to 10X 2X-50X Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky . up to 1300X scalars Bear . . . . Summary and Future Work Baby The Three Beaars – Dyalog ’12 Array Optimizers Serial Parallel size speedup speedup ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work ◮ Small arrays: Needs scalarized index-vector-to-offset primitive

  81. . mature Robert Bernecky And, they lived more or less happily ever after! Thank you! 2X-50X up to 10X nearly done large Papa enables other opts up to 20X developing small Mama none . up to 1300X scalars Bear . . . . Baby Summary and Future Work Array Optimizers Serial Parallel size speedup speedup The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work ◮ Small arrays: Needs scalarized index-vector-to-offset primitive ◮ Small arrays: Perhaps (likely!), additional work will be needed

Recommend


More recommend