2
3 Intel 48-core SCC processor Tilera 100-core processor
• Introduction • Parallel program structure and behavior – Case study of fluidanimate – Thread criticality problem – Communication impact on thread criticality • Thread-criticality support in on-chip network – Bypass flow control – Priority-based arbitration • Methodology & results • Conclusion 4
void AdvanceFrameMT(int i) • fluidanimate in PARSEC { • Particle-based fluid simulation ClearParticlesMT(i); which solves Navier-Stokes pthread_barrier_wait(&barrier); RebuildGridMT(i); equations pthread_barrier_wait(&barrier); • Particles are spatially sorted in a InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); uniform grid and each thread ComputeDensitiesMT(i); covers subgrids in entire pthread_barrier_wait(&barrier); simulation domain. ComputeDensities2MT(i); • Each thread executes pthread_barrier_wait(&barrier); ComputeForcesMT(i); AdvanceFrameMT () function. pthread_barrier_wait(&barrier); • 8 sub-functions with 8 barriers ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); • Provided input sets process 5 AdvanceParticlesMT(i); frames. pthread_barrier_wait(&barrier); } 5
void AdvanceFrameMT(int i) N threads (one thread per core) { ClearParticlesMT(i); … pthread_barrier_wait(&barrier); RebuildGridMT(i); Barrier 0 pthread_barrier_wait(&barrier); Execution time … InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); Barrier 1 ComputeDensitiesMT(i); pthread_barrier_wait(&barrier); … ComputeDensities2MT(i); Barrier 2 pthread_barrier_wait(&barrier); ComputeForcesMT(i); … … ... pthread_barrier_wait(&barrier); ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); AdvanceParticlesMT(i); … pthread_barrier_wait(&barrier); Barrier 7 } 6
ComputeDensitiesMT ComputeForcesMT thread If we accelerate half number of threads, execution time of ComputeForcesMT can be reduced by 29%. Execution time (10 8 cycles) 7 Barrier 3 Barrier 5
• Variation of executed instructions – Different execution paths in the same control flow graph ⇒ Different computation time • Variation of memory accesses – Different cache behavior on L2 cache – Thread criticality predictor based on per-core L2 hits and misses (Bhattacharjee et al., ISCA ’09) • Larger total L1 miss penalties ==> higher thread criticality ⇒ Different memory stall time 8
• Large portion of cache miss penalty is communication latency incurred by on-chip network. – L2 cache is distributed and interconnected with multiple banks. • L2 cache access latency in 8x8 mesh network – 3-cycle hop latency, 6-cycle bank access latency, 12 hops for round trip (uniform random) – 36 cycles (86%) are communication latency in total latency of 42 cycles. • Our work aims at reducing communication latency of high-criticality threads to accelerate their execution. 9
• Low latency support – Express virtual channel (ISCA ‘07) • Router pipeline skipping by flow control – Single-cycle router (ISCA ‘04) • All dependent operations through speculation are handled in single cycle. • Quality of service support – Globally synchronized frames (ISCA ‘09) • Per-flow bandwidth guarantee within a time window – Application-aware prioritization (MICRO ‘09) • High system throughput across many single-threaded applications by exploiting different stall cycles per packet in each application 10
• Bypass flow control – Reduce per-hop latency for critical threads. – Preserve internal router state to skip router pipelines. – Find a state that maximizes bypassing opportunities. • Priority-based arbitration – Reduce stall time caused by router resource arbitration for critical threads. – Assign high priority to critical threads and low priority to non- critical threads. – Allocate VCs and switch ports based on priority-based arbitration. 11
• Router preserves a bypass (default) state between input ports and output ports. • When a packet follows the same path in a bypass state of router, it bypasses router pipeline and directly goes to link. • Bypass state corresponds to preserved router resources. – Bypass VC • Preserved VC for bypass – State-preserving switch crossbar • Preserved switch input/output ports for bypass 12
Port state table IN OUT Routing 0 1 VC allocator 1 0 2 2 Switch allocator Bypass VC 3 3 Input 0 Output 0 … … … Input 3 Output 3 … Crossbar switch State-preserving crossbar switch 13
Port state table IN OUT 0 1 // 1 0 2x4 decoder 2 2 3 3 Input 0 Output 0 Input 0 … … Input 3 Input 3 Output 3 Crossbar switch Output 0 Output 3 State is preserved when switch allocation does not occur at previous cycle. 14
• Each router has switch usage counters. – Each counter is incremented on a packet basis only for critical threads. – Each counter tracks usage of one input port and one output port of switch. • n 2 counters for n × n switch • Trade-off more (monitoring) resources for improved performance • These counters are used to update port state table periodically. • Each port state table represents switch usage patterns for critical threads during previous time interval. 15
• When multiple packets request the same resource, arbitration is necessary. – VC arbitration, switch arbitration, speculative-switch arbitration • Higher-priority packets win arbitration over lower-priority packets. – This priority is the same as level of thread criticality. • Aging for starvation freedom 16
• 64-core system modeled by SIMICS – 8x8 mesh network – 2-stage pipeline router + 1-cycle link • 3-cycle hop latency (no bypass) • 1-cycle hop latency (bypass) – 6-cycle bank access for 16MB L2 cache • PARSEC benchmarks • Thread criticality predictor based on accumulated L1 miss penalty – Switch usage counters are updated only for top four critical threads. 17
18
19
20
21
• Each thread can have different performance due to different memory behavior. • Accelerating slowest (critical) threads reduces execution time of parallel applications. • On-chip network is designed to support thread criticality through bypass flow control and priority-based arbitration techniques. 22
Recommend
More recommend