01110111010110 11110101010101 00101011010011 01010111010101 01001010101010 10101010101010 The mystery of the computer –– bits and data 10101011110101 Mikko Kivelä 01010101011101 01010111010110 Department of Computer Science Aalto University 10101101010110 10101110101010 CS-A1120 Programming 2 11101010101101 01110111010110 2 March 2020 10111011010101 11110101010101 Lecture notes based on material created by Petteri Kaski 00010101010101 01011010101110 10101010100101
Billion computations per second def test(m : Long) = { var i = 1L var s = 0L while (i <= m) { // s = 1 + 2 + ... + m s = s + i i = i + 1 } s } val NANOS_PER_SEC = 1e9 val test_start_time = System .nanoTime test(4000000000L) val test_end_time = System .nanoTime val test_duration = test_end_time - test_start_time println("test took %.2f seconds".format(test_duration/ NANOS_PER_SEC ))
Intel Skylake – machine code example (***) 1029: c4 e2 7d 19 02 vbroadcastsd (%rdx),%ymm0 Example: 102e: c4 e2 7d 19 0c 0a vbroadcastsd (%rdx,%rcx,1),%ymm1 1034: c4 e2 7d 19 14 4a vbroadcastsd (%rdx,%rcx,2),%ymm2 The innermost loop of a 103a: 48 83 c2 08 add $0x8,%rdx matrix multiplication 103e: c5 fd 28 18 vmovapd (%rax),%ymm3 1042: c4 e2 fd b8 e3 vfmadd231pd %ymm3,%ymm0,%ymm4 subroutine implemented 1047: c4 e2 f5 b8 eb vfmadd231pd %ymm3,%ymm1,%ymm5 with Intel X86-64 machine 104c: c4 e2 ed b8 f3 vfmadd231pd %ymm3,%ymm2,%ymm6 code with AVX2 and FMA 1051: c5 fd 28 58 20 vmovapd 0x20(%rax),%ymm3 1056: c4 e2 fd b8 fb vfmadd231pd %ymm3,%ymm0,%ymm7 extensions supported by 105b: c4 62 f5 b8 c3 vfmadd231pd %ymm3,%ymm1,%ymm8 the Skylake architecture 1060: c4 62 ed b8 cb vfmadd231pd %ymm3,%ymm2,%ymm9 1065: c5 fd 28 58 40 vmovapd 0x40(%rax),%ymm3 106a: c4 62 fd b8 d3 vfmadd231pd %ymm3,%ymm0,%ymm10 106f: c4 62 f5 b8 db vfmadd231pd %ymm3,%ymm1,%ymm11 1074: c4 62 ed b8 e3 vfmadd231pd %ymm3,%ymm2,%ymm12 1079: c5 fd 28 58 60 vmovapd 0x60(%rax),%ymm3 107e: c4 62 fd b8 eb vfmadd231pd %ymm3,%ymm0,%ymm13 1083: c4 62 f5 b8 f3 vfmadd231pd %ymm3,%ymm1,%ymm14 1088: c4 62 ed b8 fb vfmadd231pd %ymm3,%ymm2,%ymm15 108d: 48 01 c8 add %rcx,%rax 1090: 48 ff cb dec %rbx 1093: 75 94 jne 1029 https://github.com/pkaski/cluster-play/blob/master/haswell-mm-test/libmynative.c
? Intel Skylake – machine code example (***) 1029: c4 e2 7d 19 02 vbroadcastsd (%rdx),%ymm0 Example: 102e: c4 e2 7d 19 0c 0a vbroadcastsd (%rdx,%rcx,1),%ymm1 1034: c4 e2 7d 19 14 4a vbroadcastsd (%rdx,%rcx,2),%ymm2 The innermost loop of a 103a: 48 83 c2 08 add $0x8,%rdx matrix multiplication 103e: c5 fd 28 18 vmovapd (%rax),%ymm3 1042: c4 e2 fd b8 e3 vfmadd231pd %ymm3,%ymm0,%ymm4 subroutine implemented 1047: c4 e2 f5 b8 eb vfmadd231pd %ymm3,%ymm1,%ymm5 with Intel X86-64 machine 104c: c4 e2 ed b8 f3 vfmadd231pd %ymm3,%ymm2,%ymm6 code with AVX2 and FMA 1051: c5 fd 28 58 20 vmovapd 0x20(%rax),%ymm3 1056: c4 e2 fd b8 fb vfmadd231pd %ymm3,%ymm0,%ymm7 extensions supported by 105b: c4 62 f5 b8 c3 vfmadd231pd %ymm3,%ymm1,%ymm8 the Skylake architecture 1060: c4 62 ed b8 cb vfmadd231pd %ymm3,%ymm2,%ymm9 1065: c5 fd 28 58 40 vmovapd 0x40(%rax),%ymm3 106a: c4 62 fd b8 d3 vfmadd231pd %ymm3,%ymm0,%ymm10 106f: c4 62 f5 b8 db vfmadd231pd %ymm3,%ymm1,%ymm11 1074: c4 62 ed b8 e3 vfmadd231pd %ymm3,%ymm2,%ymm12 1079: c5 fd 28 58 60 vmovapd 0x60(%rax),%ymm3 107e: c4 62 fd b8 eb vfmadd231pd %ymm3,%ymm0,%ymm13 1083: c4 62 f5 b8 f3 vfmadd231pd %ymm3,%ymm1,%ymm14 1088: c4 62 ed b8 fb vfmadd231pd %ymm3,%ymm2,%ymm15 108d: 48 01 c8 add %rcx,%rax 1090: 48 ff cb dec %rbx 1093: 75 94 jne 1029 https://github.com/pkaski/cluster-play/blob/master/haswell-mm-test/libmynative.c
NVIDIA Volta – machine code example (***) LOP3.LUT R8, R6, R8, R19, 0x96, !PT; /* 0x0000000806087212 */ /* 0x000fe400078e9613 */ LOP3.LUT R64, R11, R64, RZ, 0x3c, !PT; /* 0x000000400b407212 */ /* 0x000fc400078e3cff */ Example: LOP3.LUT R62, R62, R5, R4.reuse, 0x96, !PT; /* 0x000000053e3e7212 */ /* 0x100fe400078e9604 */ Part an inner loop of an LOP3.LUT R17, R17, R15.reuse, R7.reuse, 0x78, !PT; /* 0x0000000f11117212 */ /* 0x180fe400078e7807 */ algorithm (vertex-localized LOP3.LUT R8, R8, R15, R7, 0x78, !PT; /* 0x0000000f08087212 */ /* 0x000fe400078e7807 */ graph motif search) with LOP3.LUT R18, R19.reuse, R18, R4.reuse, 0x96, !PT; /* 0x0000001213127212 */ /* 0x140fe400078e9604 */ NVIDIA GV100 LOP3.LUT R5, R19, R10, R4, 0x96, !PT; /* 0x0000000a13057212 */ GPU machine code /* 0x000fe400078e9604 */ LOP3.LUT R9, R6, R9, R19, 0x96, !PT; /* 0x0000000906097212 */ (Compute Capability 7.0) /* 0x000fc400078e9613 */ LOP3.LUT R7, R64, R15, R7, 0x78, !PT; /* 0x0000000f40077212 */ /* 0x000fe400078e7807 */ LOP3.LUT R61, R61, R12, R19, 0x96, !PT; /* 0x0000000c3d3d7212 */ /* 0x000fe400078e9613 */ LOP3.LUT R59, R17, R59, RZ, 0x3c, !PT; /* 0x0000003b113b7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R60, R60, R5, R6, 0x96, !PT; /* 0x000000053c3c7212 */ /* 0x000fe400078e9606 */ LOP3.LUT R58, R9, R58, RZ, 0x3c, !PT; /* 0x0000003a093a7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R51, R18, R51, RZ, 0x3c, !PT; /* 0x0000003312337212 */ /* 0x000fc400078e3cff */ LOP3.LUT R50, R8, R50, RZ, 0x3c, !PT; /* 0x0000003208327212 */ /* 0x000fe400078e3cff */ LOP3.LUT R57, R7, R57, RZ, 0x3c, !PT; /* 0x0000003907397212 */ /* 0x000fe200078e3cff */ @P0 BRA 0x8d0; /* 0xfffff96000000947 */ /* 0x000fee000383ffff */ https://github.com/pkaski/motif-localized
/* 0x000fee000383ffff */ ? NVIDIA Volta – machine code example (***) LOP3.LUT R8, R6, R8, R19, 0x96, !PT; /* 0x0000000806087212 */ /* 0x000fe400078e9613 */ LOP3.LUT R64, R11, R64, RZ, 0x3c, !PT; /* 0x000000400b407212 */ /* 0x000fc400078e3cff */ Example: LOP3.LUT R62, R62, R5, R4.reuse, 0x96, !PT; /* 0x000000053e3e7212 */ /* 0x100fe400078e9604 */ Part an inner loop of an LOP3.LUT R17, R17, R15.reuse, R7.reuse, 0x78, !PT; /* 0x0000000f11117212 */ /* 0x180fe400078e7807 */ algorithm (vertex-localized LOP3.LUT R8, R8, R15, R7, 0x78, !PT; /* 0x0000000f08087212 */ /* 0x000fe400078e7807 */ graph motif search) with LOP3.LUT R18, R19.reuse, R18, R4.reuse, 0x96, !PT; /* 0x0000001213127212 */ /* 0x140fe400078e9604 */ NVIDIA GV100 LOP3.LUT R5, R19, R10, R4, 0x96, !PT; /* 0x0000000a13057212 */ GPU machine code /* 0x000fe400078e9604 */ LOP3.LUT R9, R6, R9, R19, 0x96, !PT; /* 0x0000000906097212 */ (Compute Capability 7.0) /* 0x000fc400078e9613 */ LOP3.LUT R7, R64, R15, R7, 0x78, !PT; /* 0x0000000f40077212 */ /* 0x000fe400078e7807 */ LOP3.LUT R61, R61, R12, R19, 0x96, !PT; /* 0x0000000c3d3d7212 */ /* 0x000fe400078e9613 */ LOP3.LUT R59, R17, R59, RZ, 0x3c, !PT; /* 0x0000003b113b7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R60, R60, R5, R6, 0x96, !PT; /* 0x000000053c3c7212 */ /* 0x000fe400078e9606 */ LOP3.LUT R58, R9, R58, RZ, 0x3c, !PT; /* 0x0000003a093a7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R51, R18, R51, RZ, 0x3c, !PT; /* 0x0000003312337212 */ /* 0x000fc400078e3cff */ LOP3.LUT R50, R8, R50, RZ, 0x3c, !PT; /* 0x0000003208327212 */ /* 0x000fe400078e3cff */ LOP3.LUT R57, R7, R57, RZ, 0x3c, !PT; /* 0x0000003907397212 */ /* 0x000fe200078e3cff */ @P0 BRA 0x8d0; /* 0xfffff96000000947 */ https://github.com/pkaski/motif-localized
The mystery of the computer What are the principles of how computers work? What is computing ?
Why is it important that a programmer understand the central principles of computers?
• Computer is a machine –– understanding the basic principles of how this machine works is a fundamental part of programmers professional competence • Skills for applications where the computer needs to be used at the limits of its performance • Physical device (“hardware”) and programs (“software”) are interacting all the way from design to execution • Curiosity and the joy of finding out how things work
dgx01.triton.aalto.fi (***) (NVIDIA DGX-1, 8 x Tesla V100 GPU, 40960 cores , 3.2 kW, 170 teraflops)
Finland: Mahti & Puhti (***) New Finnish supercomputer Puhti: 320 Nvidia V100 Volta GPUs (2.7 petaflops) @ CSC Kajaani (Atos ~27 000 Intel Xeon cores (2.5 petaflops) BullSequana) Mahti: ~180 000 AMD EPYC cores (7.5 petaflops) https://research.csc.fi/techspecs/
Summit: #1 top500.org (***) (~4600 x 6 x 5120 32 bit cores ~4600 computational nodes, = ~ 140 million cores every node has six 1312 MHz clock rate, 15 MW) NVIDIA Volta V100s ~200 petaflops https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/
Recommend
More recommend