memory
play

Memory FIFOs for uncommitted writes Consistency Invalidate queues - PowerPoint PPT Presentation

Sistemi operativi Operating Systems Universit degli studi di Udine Sistemi operativi Operating Systems Universit degli studi di Udine Sources of out-of-order memory accesses Compiler optimizations Store buffers Memory


  1. Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Sources of out-of-order memory accesses � Compiler optimizations � Store buffers Memory � FIFOs for uncommitted writes Consistency � Invalidate queues (for cache coherency) � Data prefetch Models � Banked cache architectures � Networked interconnect � Non-uniform memory access (NUMA) architectures: different accesses to memory have different latencies � ... Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine int add3 (int x) { add3: int add3 (int x) { add3: int i; mov r0, r0, asl #3 int i; mov r0, r0, asl #3 Compiler optimizations for (i=0; i<3; i++) x += x; mov pc, lr for (i=0; i<3; i++) x += x; mov pc, lr return x; return x; } } C code ARM assembly code This function always returns 8·x: compiler can optimize code � Language semantic does not consider int add_vals (int *vec) { add_vals: int add_vals (int *vec) { add_vals: 1. Side-effects of memory accesses int y = vec[1]; ldr r3, [r0] int y = vec[1]; ldr r3, [r0] y += vec[0]; ldr r0, [r0, #4] y += vec[0]; ldr r0, [r0, #4] 2. Multi-threading return y; add r0, r0, r3 return y; add r0, r0, r3 } mov pc, lr } mov pc, lr 3. Asynchronous execution C code ARM assembly code Result does not depend on access order: compiler can change loads order � Compiler can: waitval: waitval: � Reorder instructions ldr r3, [r0] ldr r3, [r0] void waitval (int *ptr) { void waitval (int *ptr) { cmp r3, #0 cmp r3, #0 while (*ptr == 0) � Eliminate operations while (*ptr == 0) movne pc, lr movne pc, lr continue; continue; loop: loop: } } � Some compiler optimization can be controlled by the b loop b loop volatile qualifier C code ARM assembly code Compiler does not need to consider that someone else can change *ptr

  2. Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Volatile Examples int *ptr; /* pointer to int */ int *ptr; /* pointer to int */ � Semantic volatile int *ptr_to_vol; /* pointer to volatile int */ volatile int *ptr_to_vol; /* pointer to volatile int */ � Each read from a volatile variable requires an actual load int * volatile vol_ptr; /* volatile pointer to int */ int * volatile vol_ptr; /* volatile pointer to int */ and may return a different value volatile int * volatile vol_ptr_to_vol; /* volatile pointer to volatile int */ volatile int * volatile vol_ptr_to_vol; /* volatile pointer to volatile int */ � Compiler optimization cannot merge reads from the same address � Beware the semantic: � Each write to a volatile variable requires an actual store � Compiler optimization cannot cancel stores � a = *ptr_to_vol; � is a volatile access � Required to access I/O address space � a = *vol_ptr; � is not a volatile access Note: this is the C/C++ semantic the Java semantic differs (it also implies atomicity) Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Volatile Volatile � Inconsistent qualification causes errors volatile int A; volatile int A; volatile int B; volatile int B; volatile int A; volatile int A; A=1; /* these two lines won't be */ A=1; /* these two lines won't be */ � Volatile does not enforce ordering with non-volatile volatile int B; volatile int B; B=1; /* reordered by compiler */ B=1; /* reordered by compiler */ accesses A=1; /* these two lines won't be */ A=1; /* these two lines won't be */ B=1; /* reordered by compiler but */ B=1; /* reordered by compiler but */ int A; int A; /* accesses can be reordered */ /* accesses can be reordered */ volatile int B; volatile int B; /* by HW */ /* by HW */ � Volatile does not enforce order on how access are A=1; /* these two lines can be */ A=1; /* these two lines can be */ actually performed B=1; /* reordered by compiler */ B=1; /* reordered by compiler */ � Volatile does not mean atomic volatile int X; volatile int X; X=1; /* this assignment can be interrupted or preempted */ X=1; /* this assignment can be interrupted or preempted */

  3. Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Memory barrier Store Buffer � Implementation on GCC � Record the store in buffer until is actually performed asm volatile ("" : : : "memory"); asm volatile ("" : : : "memory"); � Hide memory latency � Cache latency � Cache-miss on write � This inline assembly code: � Processor can execute other instructions 1. contains no instructions � Data dependency (RAW) 2. may read or write all of RAM � Wait until the write is actually performed in memory or in cache � Read the data from the store buffer (store forwarding) � Hence: � Data dependency (WAW) compiler memory accesses reordering is not allowed � Add a new entry in the store buffer around the barrier in either direction � Replace the previous write in the store buffer Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Example Example � Execution: � Processor P1 executes � 1: store A : cache miss P1 P2 � write the updated value in store buffer � 1) store A � send a read request (data will come from P2 cache) � 2) store B � several clock cycles needed Store Store � P1 can proceed, (the new value is in the store buffer) buffer buffer � A and B are shared with P2: � P2 does not see the write Cache Cache � 2: store B : cache hit � A is in P2 cache B B A � data is written in cache � B is in both caches � a coherence message is sent to P2 � P2 sees the write Interconnect � 3: A is loaded in P1 cache � 4: A is updated in P1 cache � a coherence message is sent to P2 � P2 sees the write � � P2 sees the store on B first, then the store on A

  4. Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Consequence Cache coherency � Cache coherency can require cache line invalidation Note : initially: A=0 and B=0 A and B are volatiles � A processor send an invalidate message to another one � Target processor must invalidate cache line A = 1 while (B==0) continue; � Invalidate Queue B = 1 assert (A==1); /* this can fail! */ � Store invalidate requests while the cache is busy � Invalidate the line when the cache is ready P1 P2 If P2 sees the stores performed by P1 in reverse order, the assertion fails Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Data prefetch Banked cache architectures � Processor can read data before the actual load � Caches split in several banks instruction � While accesses to busy banks must wait, accesses to idle banks can proceed � Hide memory latency � Preload data in cache Processor � Speculative execution Store � Execute instructions after a branch before the branch buffer Cache Cache \ Interconnect

  5. Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Definitions Definitions � Performed � Program order � Write � a write by processor i is performed with respect to processor k when: � The order of operations as specified by software � a read issued by k to the same address returns the value stored by i � Execution order � Read � a read by processor i is performed with respect to processor k when: � The order of operations as executed by a processor � a write issued by k to the same address cannot affect the value read by i � Perceived order � Globally Performed � The order of operations as seen by processors and memories � globally performed: is performed with respect to all processors � Memory consistency model � Write � A write is globally performed when its modi cation has been � Rules that specify the allowed behavior of programs in terms fi propagated to all processors of memory accesses � Read � Rules: order restrictions � A read is globally performed when the value it returns is bound and the write that wrote this value is globally performed Sistemi operativi – Operating Systems Università degli studi di Udine Sistemi operativi – Operating Systems Università degli studi di Udine Memory consistency models Memory consistency models � Rules on access ordering can regard: � Uniform consistency models � Location (address of access) � Rules do no concern category of accesses � Direction � read, write, read-write � Value � Causality � Hybrid consistency models � behavior of an access depends on the behavior of another one � Category of accesses matters � Category � shared / private � synchronizing / not synchronizing

Recommend


More recommend