THE WHOLE TAMALE Mostly
ASSEMBLY/MEMORY IMAGE # Execution begins at address 0 Process-specific data ! .pos 0 irmovq stack, %rsp # Set up stack pointer structures ! rrmovq %rsp, %rbp # initialize the base pointer Different for (e.g., page tables, ! call main # Execute main program halt # Terminate program each process ! task and mm structs, kernel ! Kernel ! # Array of 4 elements stack) ! .align 8 virtual ! array: .quad 0x000d000d000d memory ! Physical memory ! .quad 0x00c000c000c0 Identical for .quad 0x0b000b000b00 .quad 0xa000a000a000 each process ! Kernel code and data ! main: pushq %rbp rrmovq %rsp, %rbp User stack ! irmovq array,%rdi %rsp irmovq $4,%rsi call sum # sum(array, 2) ret # long sum(long *start, long count) Memory mapped region ! # start in %rdi, count in %rsi sum: for shared libraries ! pushq %rbp rrmovq %rsp, %rbp Process ! irmovq $8,%r8 # Constant 8 virtual ! irmovq $1,%r9 # Constant 1 xorq %rax,%rax # sum = 0 brk memory ! andq %rsi,%rsi # Set CC Runtime heap (via malloc) ! jmp test # Goto test loop: mrmovq (%rdi),%r10 # Get *start addq %r10,%rax # Add to sum Uninitialized data (.bss ) ! addq %r8,%rdi # start++ subq %r9,%rsi # count--. Set CC Initialized data (.data ) ! test: jne loop # Stop when 0 ret # Return Code ( .text ) ! 0x400000 ! # Stack starts here and grows to lower addresses .pos 0xf8 0 ! stack:
ASSEMBLY/MEMORY IMAGE # Execution begins at address 0 Process-specific data ! .pos 0 irmovq stack, %rsp # Set up stack pointer structures ! rrmovq %rsp, %rbp # initialize the base pointer Different for (e.g., page tables, ! call main # Execute main program halt # Terminate program each process ! task and mm structs, kernel ! Kernel ! # Array of 4 elements stack) ! .align 8 virtual ! array: .quad 0x000d000d000d memory ! Physical memory ! .quad 0x00c000c000c0 Identical for .quad 0x0b000b000b00 .quad 0xa000a000a000 each process ! Kernel code and data ! main: pushq %rbp rrmovq %rsp, %rbp User stack ! irmovq array,%rdi %rsp irmovq $4,%rsi call sum # sum(array, 2) ret # long sum(long *start, long count) Memory mapped region ! # start in %rdi, count in %rsi sum: for shared libraries ! pushq %rbp rrmovq %rsp, %rbp Process ! irmovq $8,%r8 # Constant 8 virtual ! irmovq $1,%r9 # Constant 1 xorq %rax,%rax # sum = 0 brk memory ! andq %rsi,%rsi # Set CC Runtime heap (via malloc) ! jmp test # Goto test loop: mrmovq (%rdi),%r10 # Get *start addq %r10,%rax # Add to sum Uninitialized data (.bss ) ! addq %r8,%rdi # start++ subq %r9,%rsi # count--. Set CC Initialized data (.data ) ! test: jne loop # Stop when 0 ret # Return Code ( .text ) ! 0x400000 ! # Stack starts here and grows to lower addresses .pos 0xf8 0 ! stack:
BASIC ARCHITECTURE CPU ! Register file ! ALU ! System bus ! Memory bus ! Main ! I/O ! Bus interface ! memory ! bridge ! I/O bus ! Expansion slots for ! other devices such ! as network adapters ! USB ! Host bus ! Graphics ! ! controller ! adapter ! adapter ! (SCSI/SATA) ! Mouse ! Solid ! Key ! Monitor ! state ! board ! Disk ! disk ! controller ! Disk drive !
BASIC ARCHITECTURE Control Unit Functionality Processor Control Processor State State Signals data Signals IR S Main Input / Z O Memory Output irw pewpcr cr alu cw Data Path rd rw aw rs rr maw mdr mew CPU
DATA PATH 16 pcw SELECT R0 E PC MUX E O0 2 rd SEL S O1 16 E R1 rw I0 O2 E Out irw E I1 pcr O3 I2 R2 E I3 E S E IR R3 2 rr E maw rs E 16 MA A aw 16 E mdw A B 2 E MD alu ALU C Z S O SELE 16 mdr C cw E 16 Main SEL 16 cr E Memory
add r1, r2 # Register Transfer Notation time RTN alu rd rs rw rr aw cw cr pcr pcw maw mdr mdw irw 0: MA ← PC; C ← PC+2 11 00 00 0 0 0 1 0 1 0 1 0 0 0 1: MD ← M[MA]; PC ← C 00 00 00 0 0 0 0 1 0 1 0 0 0 0 2: IR ← MD 00 00 00 0 0 0 0 0 0 0 0 1 0 1 3: A ← R[src] 00 00 01 0 1 1 0 0 0 0 0 0 0 0 4: C ← A+R[r1] 00 00 10 0 1 0 1 0 0 0 0 0 0 0 5: R[r2] ← C 00 10 00 1 0 0 0 1 0 0 0 0 0 0 This is the register transfer notation for a register to register add, where src and dest codes are used to specify the register numbers. These codes would come from the Instruction Register. Note, time step 1 is unusual, because MD is being loaded directly from memory and does not use the signals controlling the data path.
add (r1), r2(r3) # Register Transfer Notation time RTN What are the steps?
ADD (R1), R2(R3) 16 pcw SELECT R0 E PC MUX E O0 2 rd SEL S O1 16 E R1 rw I0 O2 E Out irw E I1 pcr O3 I2 R2 E I3 E S E IR R3 2 rr E maw rs E 16 MA A aw 16 E mdw A B 2 E MD alu ALU C Z S O SELE 16 mdr C cw E 16 Main SEL 16 cr E Memory
add (r1), r2(r3) # Register Transfer Notation time RTN What are the steps? 0: MA ← PC; C ← PC+2 1: MD ← M[MA]; PC ← C 1. Get the instruction 2: IR ← MD
add (r1), r2(r3) # Register Transfer Notation time RTN What are the steps? 0: MA ← PC; C ← PC+2 1: MD ← M[MA]; PC ← C 1. Get the instruction 2: IR ← MD 2. Get value referenced in destination 3: A ← R[r3] 4: C ← A + R[r2] 5: MA ← C 6: MD ← M[MA] 7: A ← MD
add (r1), r2(r3) # Register Transfer Notation time RTN What are the steps? 0: MA ← PC; C ← PC+2 1: MD ← M[MA]; PC ← C 1. Get the instruction 2: IR ← MD 2. Get value referenced in destination 3: A ← R[r3] 4: C ← A + R[r2] 3. Add value referenced in source 5: MA ← C 6: MD ← M[MA] 7: A ← MD 8: MA ← R[r1] 9: MD ← M[MA] 10: C ← A + MD
add (r1), r2(r3) # Register Transfer Notation time RTN What are the steps? 0: MA ← PC; C ← PC+2 1: MD ← M[MA]; PC ← C 1. Get the instruction 2: IR ← MD 2. Get value referenced in destination 3: A ← R[r3] 4: C ← A + R[r2] 3. Add value referenced in source 5: MA ← C 4. Store value to reference in destination 6: MD ← M[MA] 7: A ← MD 8: MA ← R[r1] 9: MD ← M[MA] 10: C ← A + MD 11: MD ← C 12: A ← R[r3] 13: C ← A + R[r2] 14: MA ← C 15: M[MA] ← MD
MEMORY OPERATION Processor package ! int sumarraycols(int a[M][N]) { Core 0 ! Core 3 ! Regs ! Regs ! int i, j, sum = 0; L1 ! L1 ! L1 ! L1 ! for (j = 0; j < N; j++) … ! d-cache ! i-cache ! d-cache ! i-cache ! for (i = 0; i < N; i++) sum += a[i][j]; L2 unified cache ! L2 unified cache ! } CPU ! L3 unified cache ! Register file ! (shared by all cores) ! ALU ! System bus ! Memory bus ! Main ! I/O ! Main memory ! Bus interface ! memory ! bridge ! I/O bus ! Expansion slots for ! Memory is pulled in chunks to other devices such ! as network adapters ! USB ! Host bus ! Graphics ! ! controller ! adapter ! adapter ! increase locality (SCSI/SATA) ! Mouse ! Solid ! Key ! Monitor ! state ! board ! Disk ! disk ! controller ! Disk drive !
BASIC ARCHITECTURE L0: ! Regs ! CPU registers hold words retrieved from Smaller, ! cache memory. ! faster, ! L1 cache ! and ! L1: ! costlier ! (SRAM) ! L1 cache holds cache lines retrieved (per byte) ! from the L2 cache. ! storage ! L2 cache ! L2: ! devices ! (SRAM) ! L2 cache holds cache lines ! retrieved from L3 cache ! L3: ! L3 cache ! (SRAM) ! L3 cache holds cache lines ! retrieved from memory. ! Larger, ! slower, ! L4: ! Main memory ! and ! cheaper ! (DRAM) ! Main memory holds disk ! (per byte) ! blocks retrieved from local ! storage ! disks. ! devices ! Local secondary storage ! L5: ! (local disks) ! Local disks hold files retrieved from disks on remote network servers. ! L6: ! Remote secondary storage ! (distributed file systems, Web servers) !
PRINCIPAL OF LOCALITY Spatial Locality, or the fact that when a given address has been referenced, it is likely that addresses near it will be referenced within a short period of time. Temporal Locality, or the fact that once a particular memory item has been referenced, it is likely that it will be referenced again within a short period of time.
ASSEMBLY/CACHING # Execution begins at address 0 .pos 0 irmovq stack, %rsp # Set up stack pointer rrmovq %rsp, %rbp # initialize the base pointer call main # Execute main program halt # Terminate program # Array of 4 elements Tag Valid Cache .align 8 Memory Bits Memory Main Memory block numbers array: .quad 0x000d000d000d Group #: .quad 0x00c000c000c0 30 1 0 0 256 512 7680 7936 0 .quad 0x0b000b000b00 .quad 0xa000a000a000 9 1 1 1 257 513 2305 7681 7937 1 1 1 2 2 258 514 7681 7938 2 main: pushq %rbp rrmovq %rsp, %rbp irmovq array,%rdi irmovq $4,%rsi 1 1 255 255 511 767 8191 255 call sum # sum(array, 2) ret 0 1 2 9 30 31 Tag #: Tag One # long sum(long *start, long count) field, Cache # start in %rdi, count in %rsi 5 bits line, Memory Address: 5 8 3 sum: 8 bytes pushq %rbp Tag Group Byte rrmovq %rsp, %rbp irmovq $8,%r8 # Constant 8 irmovq $1,%r9 # Constant 1 xorq %rax,%rax # sum = 0 andq %rsi,%rsi # Set CC jmp test # Goto test loop: mrmovq (%rdi),%r10 # Get *start addq %r10,%rax # Add to sum addq %r8,%rdi # start++ subq %r9,%rsi # count--. Set CC test: jne loop # Stop when 0 ret # Return # Stack starts here and grows to lower addresses .pos 0xf8 stack:
Recommend
More recommend