by their fruits shall ye know them
play

BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON - PowerPoint PPT Presentation

BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN Holger Pirk Sam Madden Mike Stonebraker A CRUCIAL DISTINCTION INSPIRATION MY PLEDGE OF LOYALTY SCIENTIFIC RATIONALE GENE AMDAHL


  1. BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYST’S PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN Holger Pirk Sam Madden Mike Stonebraker

  2. A CRUCIAL DISTINCTION ≠

  3. INSPIRATION

  4. MY PLEDGE OF LOYALTY

  5. SCIENTIFIC RATIONALE

  6. GENE AMDAHL TAUGHT US THAT SYSTEMS NEED TO BE BALANCED 1E Processed Instructions per Second 1P 1T Processing 50 0 GB /s 1 10 100 Processed Bytes per Instruction

  7. NVIDIA AND AMD PROCESS LOT OF SMALL DATA WORDS 1E Processed Instructions per Second 1P AMD Nvidia 1T Processing 50 0 GB /s 1 10 100 Processed Bytes per Instruction

  8. Instruction Scheduler SIMT Cores SIMT Memory

  9. INTEL PROCESSES FEWER LARGE DATAWORDS 1E Processed Instructions per Second 1P AMD Nvidia Intel 1T Processing 50 0 GB /s 1 10 100 Processed Bytes per Instruction

  10. MANY -CORE SIMD SIMD Core SIMD Core SIMD Core SIMD Core Pentium Cores SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core 512 Bits Memory

  11. SIMD WITH SCATTER/GATHER SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core Scatter/Gather SIMD Core Unit Memory

  12. ALL OF THEM CAN PROCESS WAY MORE DATA THAN THEY CAN LOAD 1E Processed Instructions per Second 1P AMD Nvidia Intel 1T Processing 50 0 GB /s 1 10 100 Processed Bytes per Instruction

  13. SPEC BANDWIDTH-WISE, PHI OUTPERFORMS CURRENT GPUS 400 300 GB/s Memory Bandwidth 200 100 0 Phi GTX 780

  14. OUR QUESTION: DOES IT MATTER? DOES PHI CHANGE ANYTHING? 1E Processed Instructions per Second 1P AMD Nvidia Intel 1T Processing 50 0 GB /s 1 10 100 Processed Bytes per Instruction

  15. THE OBSTACLE COURSE

  16. DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS Ɣ Synchronization Computation Bandwidth Capacity π Facts Dimension

  17. DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS Ɣ Hash Complexity # of Conflicts Tuple Width Access Locality π Facts Dimension

  18. PHI VS. GTX 780

  19. FIRST CHOKEPOINT Ɣ Bandwidth π Facts Dimension

  20. BANDWIDTH OF PHI LOOKS SIMILAR TO GPU AT FIRST GLANCE GTX 780 Xeon Phi 1.28 Time per Access in ns 0.64 0.32 0.16 0.08 0.04 4 8 16 32 64 128 256 512 Stride in Bytes

  21. A SECOND GLANCE REVEALS SOMETHING ODD… GTX 780 Xeon Phi 1.28 Time per Access in ns 0.64 0.32 A Non-Linear Cost Function 0.16 0.08 0.04 4 8 16 32 64 128 256 512 Stride in Bytes

  22. A SECOND GLANCE REVEALS SOMETHING ODD… GTX 780 Xeon Phi 1.28 Time per Access in ns 0.64 0.32 Not Dominated (only) by Cache Misses 0.16 0.08 0.04 4 8 16 32 64 128 256 512 Stride in Bytes

  23. SECOND CHOKEPOINT Ɣ Capacity π Facts Dimension

  24. PHI BENEFITS FROM LARGER CACHES GTX 780 Xeon Phi Xeon Phi Lower Bound GTX 780 Lower Bound 1.28 Time per Access in ns 0.64 0.32 0.16 0.08 0.04 0.02 64 512 4K 32K 256K 2M 16M Size of Lookup Table in Bytes

  25. THIRD CHOKEPOINT Ɣ Computation π Facts Dimension

  26. COMPUTATION PERFORMANCE IS VERY SIMILAR… Xeon Phi GTX 780 0.80 Time per hash in ns 0.40 0.20 0.10 0.05 1 2 4 8 16 32 Number of Murmur Rehashes

  27. THIRD CHOKEPOINT Ɣ Synchronization π Facts Dimension

  28. …AND SO IS HASH-BUILDING GTX 780 Xeon Phi 15.0 Time per Access in ns 10.0 5.0 0.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of Values per Bucket

  29. RECAP • Phi & GPU mostly en par in • Computation • Synchronization • Cache-Utilization • But what is up with the memory access

  30. PHI IN DEPTH

  31. SCATTER/GATHER

  32. �� ����������� ����� ���� � ������ �� ��� ����� ���� �� ��������� �� �� � ��� �� � �� � � �� ���� � ����� � ����������������� LET’S LOOK AT THE DOCUMENTATION CHAPTER 6. INSTRUCTION DESCRIPTIONS VGATHERDPD - Gather Float64 Vector With Signed Dword Indices Opcode Instruction Description MVEX.512.66.0F38.W1 92 vgatherdpd zmm1 {k1}, Gather �loat64 vector U f 64 ( mv t ) into �loat64 /r /vsib U f 64 ( mv t ) vector zmm1 using doubleword indices and k1 as completion mask. Description A set of 8 memory locations pointed by base address BASE _ ADDR and doubleword index vector V INDEX with scale SCALE are converted to a �loat64 vector. The result is written into �loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function SELECT _ SUBSET ). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signi�icant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( ∼ 0x3F) and (element_linear_address & ( ∼ 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element before up-conversion. Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector V INDEX . Operation �� ��� mv t �� ������ ������ ������� ������ 297 Reference Number: 327364-001

  33. �� ����������� ����� ���� � ������ �� ��� ����� ���� ����� � ����������������� �� ��� �� ������ ������ ������� ������ ��� �� � �� � � �� ���� � �� ��������� �� �� � LET’S LOOK AT THE DOCUMENTATION CHAPTER 6. INSTRUCTION DESCRIPTIONS VGATHERDPD - Gather Float64 Vector With Signed Dword Indices Opcode Instruction Description MVEX.512.66.0F38.W1 92 vgatherdpd zmm1 {k1}, Gather �loat64 vector U f 64 ( mv t ) into �loat64 /r /vsib U f 64 ( mv t ) vector zmm1 using doubleword indices and k1 as completion mask. ??? Description A set of 8 memory locations pointed by base address _ and doubleword index vector with scale are converted to a �loat64 vector. The result is written into �loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function _ ). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signi�icant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( 0x3F) and (element_linear_address & ( 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element before up-conversion. Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector . Operation 297 Reference Number: 327364-001

Recommend


More recommend