Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1
M Motivation i i The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention For multi-threaded workloads, contention is unavoidable To reduce contention it is necessary to understand To reduce contention, it is necessary to understand where and how the contention is created 2
Shared Resource Contention in Shared Resource Contention in Chip ‐ Multiprocessors p p Application 1 Application 1 C0 C C C1 C C2 C C3 Thread L1 L1 L1 L1 Application 2 Thread Thread L2 L2 Front -Side Bus Memory Intel Quad Core Q9550 3
Scenario 1 Scenario 1 Multi ‐ threaded applications pp With co-runner Application 1 Thread C0 C1 C2 C3 3 Application 2 Thread L1 L1 L1 L1 L2 L L L2 Memory 4
Scenario 2 Scenario 2 Multi ‐ threaded applications pp Without co-runner Application Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory 5
Shared ‐ Resource Contention Intra application contention Intra-application contention Contention among threads from the same application (No co-runners) ( ) Inter-application contention Contention among threads from the co-running application 6
C Contributions ib i A general methodology to evaluate a multi-threaded g gy application’s performance Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources Characterizing applications facilitates better understanding of the application’s resource sensitivity understanding of the application s resource sensitivity Thorough performance analyses and characterization Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7
O tli Outline Motivation Motivation Contributions Methodology gy Measuring intra-application contention Measuring inter-application contention g pp Related Work Summary 8
Methodology Methodology Designed to measure both intra- and inter- application contention for a targeted shared resource application contention for a targeted shared resource L1-cache, L2-cache Front Side Bus (FSB) Each application is run in two configurations Baseline: threads do not share the targeted resource Contention: threads share the targeted resource Multiple number of targeted resource Determine contention by comparing performance Determine contention by comparing performance (gathering hardware performance counters’ values) 9
O tli Outline Motivation Motivation Contributions Methodology gy Measuring intra-application contention (See paper) Measuring inter-application contention g pp Related Work Summary 10
Measuring inter ‐ application contention L1-cache Application 1 Thread Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Memory Baseline Baseline Contention Contention Configuration Configuration 11
Measuring inter ‐ application contention l L2-cache Application 1 Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 Thread L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory Baseline Contention Configuration Configuration 12
M Measuring inter ‐ application contention i i t li ti t ti FSB Application 1 Thread Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Baseline Configuration 13
Measuring intra ‐ application contention l FSB Application 1 Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 Thread L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 Memory Contention Configuration 14
PARSEC Benchmarks Application Domain Application Domain Benchmark(s) Benchmark(s) Financial Analysis Blackscholes (BS) Swaptions (SW) Computer Vision C t Vi i B d t Bodytrack (BT) k (BT) Engineering Canneal (CN) Enterprise Storage Dedup (DD) Animation Facesim (FA) Fluidanimate (FL) Similarity Search Similarity Search Ferret (FE) Ferret (FE) Rendering Raytrace (RT) Data Mining Streamcluster (SC) Media Processing Vips (VP) X264 (X2) 15
Experimental platform Experimental platform Platform 1: Yorkfield C C0 C C1 C C2 C3 C Intel Quad core Q9550 32 KB L1-D and L1-I L1 cache L1 cache L1 cache L1 cache cache h L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF 6MB L2-cache L2 cache L2 cache 2GB Memory 2GB Memory L2 L2 L2 L2 HW ‐ PF HW ‐ PF Common FSB FSB FSB interface interface FSB Memory Controller Hub (Northbridge) MB Memory 16 16
Experimental platform Experimental platform Platform 2: Harpertown C0 C2 C4 C6 C1 C3 C5 C7 L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 L2 L2 L2 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF FSB FSB FSB FSB interface interface interface interface FSB FSB Memory Controller Hub (Northbridge) MB Memory Tanima Dey 17 17
Performance Analysis Inter-application contention For i-th co-runner PercentPerformanceDifference i = ( PerformanceBase i – PerformanceContend i ) * 100 PerformanceBase PerformanceBase i Absolute performance difference sum Absolute performance difference sum APDS = Σ abs ( PercentPerformanceDifferencei ) 18
I t Inter ‐ application contention li ti t ti L1-cache – for Streamcluster Inter-application L1-cache Contention 8 ifference (%) 6 4 2 erformance D 0 -2 -4 Pe -6 -8 ptions choles ytrack anneal Dedup acesim Ferret nimate ytrace Vips X264 Body D Ray Swap Blacksc Ca Fluidan Fa Co-running benchmarks 19
Inter application L1 cache contention Inter ‐ application L1 ‐ cache contention Streamcluster Inter-application L1-cache Contention nce (%) 8 6 mance Differen 4 2 0 -2 Perform -4 4 -6 -8 acesim choles dytrack anneal Dedup Ferret nimate aytrace cluster Vips X264 aptions D Fluidan Blacksc Ca Streamc Fa Bod Ra Swa Co-running benchmarks 20
I t Inter ‐ application contention li ti t ti L1-cache 21 21
I t Inter ‐ application contention li ti t ti L2-cache 22
I t Inter ‐ application contention li ti t ti FSB 23
Characterization Benchmarks L1 ‐ cache L2 ‐ cache FSB Blackscholes none none none Bodytrack inter inter intra C Canneal l i t intra i t inter i t intra Dedup inter intra, inter intra, inter Facesim inter inter intra Ferret intra intra, inter intra Fluidanimate inter inter intra Raytrace Raytrace none none none none intra intra Streamcluster inter inter intra Swaptions none none none Vi Vips i intra i inter i inter X264 inter intra, inter intra 24
Summary The methodology generalizes contention analysis of multi-threaded applications New approach to characterize applications N h t h t i li ti Useful for performance analysis of existing and future architecture or benchmarks architecture or benchmarks Helpful for creating new workloads of diverse properties Provides insights for designing improved contention- aware scheduling methods h d li th d 25
Related Work Cache contention Knauerhase et al. IEEE Micro 2008 Zhuravleve et al ASPLOS 2010 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011 Characterizing parallel workload Jin et al., NASA Technical Report 2009 PARSEC benchmark suite Bienia et al. PACT 2008 Bhadauria et al IISWC 2009 Bhadauria et al. IISWC 2009 26
Thank you! Thank you! 27
Recommend
More recommend