tanima dey
play

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a - PowerPoint PPT Presentation

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1 M Motivation i i The number of cores doubles every 18 months Expected:


  1. Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1

  2. M Motivation i i  The number of cores doubles every 18 months  Expected: Performance number of cores  One of the bottlenecks is shared resource contention  For multi-threaded workloads, contention is unavoidable  To reduce contention it is necessary to understand  To reduce contention, it is necessary to understand where and how the contention is created 2

  3. Shared Resource Contention in Shared Resource Contention in Chip ‐ Multiprocessors p p Application 1 Application 1 C0 C C C1 C C2 C C3 Thread L1 L1 L1 L1 Application 2 Thread Thread L2 L2 Front -Side Bus Memory Intel Quad Core Q9550 3

  4. Scenario 1 Scenario 1 Multi ‐ threaded applications pp  With co-runner Application 1 Thread C0 C1 C2 C3 3 Application 2 Thread L1 L1 L1 L1 L2 L L L2 Memory 4

  5. Scenario 2 Scenario 2 Multi ‐ threaded applications pp  Without co-runner Application Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory 5

  6. Shared ‐ Resource Contention  Intra application contention  Intra-application contention  Contention among threads from the same application (No co-runners) ( )  Inter-application contention  Contention among threads from the co-running application 6

  7. C Contributions ib i  A general methodology to evaluate a multi-threaded g gy application’s performance  Intra-application contention  Inter-application contention  Contention in the memory-hierarchy shared resources  Characterizing applications facilitates better understanding of the application’s resource sensitivity understanding of the application s resource sensitivity  Thorough performance analyses and characterization Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7

  8. O tli Outline  Motivation Motivation  Contributions  Methodology gy  Measuring intra-application contention  Measuring inter-application contention g pp  Related Work  Summary 8

  9. Methodology Methodology  Designed to measure both intra- and inter- application contention for a targeted shared resource application contention for a targeted shared resource  L1-cache, L2-cache  Front Side Bus (FSB)  Each application is run in two configurations  Baseline: threads do not share the targeted resource  Contention: threads share the targeted resource  Multiple number of targeted resource  Determine contention by comparing performance  Determine contention by comparing performance (gathering hardware performance counters’ values) 9

  10. O tli Outline  Motivation Motivation  Contributions  Methodology gy  Measuring intra-application contention (See paper)  Measuring inter-application contention g pp  Related Work  Summary 10

  11. Measuring inter ‐ application contention  L1-cache Application 1 Thread Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Memory Baseline Baseline Contention Contention Configuration Configuration 11

  12. Measuring inter ‐ application contention l  L2-cache Application 1 Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 Thread L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory Baseline Contention Configuration Configuration 12

  13. M Measuring inter ‐ application contention i i t li ti t ti  FSB Application 1 Thread Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Baseline Configuration 13

  14. Measuring intra ‐ application contention l  FSB Application 1 Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 Thread L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 Memory Contention Configuration 14

  15. PARSEC Benchmarks Application Domain Application Domain Benchmark(s) Benchmark(s) Financial Analysis Blackscholes (BS) Swaptions (SW) Computer Vision C t Vi i B d t Bodytrack (BT) k (BT) Engineering Canneal (CN) Enterprise Storage Dedup (DD) Animation Facesim (FA) Fluidanimate (FL) Similarity Search Similarity Search Ferret (FE) Ferret (FE) Rendering Raytrace (RT) Data Mining Streamcluster (SC) Media Processing Vips (VP) X264 (X2) 15

  16. Experimental platform Experimental platform  Platform 1: Yorkfield C C0 C C1 C C2 C3 C  Intel Quad core Q9550  32 KB L1-D and L1-I L1 cache L1 cache L1 cache L1 cache cache h L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF  6MB L2-cache L2 cache L2 cache  2GB Memory  2GB Memory L2 L2 L2 L2 HW ‐ PF HW ‐ PF  Common FSB FSB FSB interface interface FSB Memory Controller Hub (Northbridge) MB Memory 16 16

  17. Experimental platform Experimental platform  Platform 2: Harpertown C0 C2 C4 C6 C1 C3 C5 C7 L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 L2 L2 L2 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF FSB FSB FSB FSB interface interface interface interface FSB FSB Memory Controller Hub (Northbridge) MB Memory Tanima Dey 17 17

  18. Performance Analysis  Inter-application contention  For i-th co-runner PercentPerformanceDifference i = ( PerformanceBase i – PerformanceContend i ) * 100 PerformanceBase PerformanceBase i  Absolute performance difference sum  Absolute performance difference sum APDS = Σ abs ( PercentPerformanceDifferencei ) 18

  19. I t Inter ‐ application contention li ti t ti  L1-cache – for Streamcluster Inter-application L1-cache Contention 8 ifference (%) 6 4 2 erformance D 0 -2 -4 Pe -6 -8 ptions choles ytrack anneal Dedup acesim Ferret nimate ytrace Vips X264 Body D Ray Swap Blacksc Ca Fluidan Fa Co-running benchmarks 19

  20. Inter application L1 cache contention Inter ‐ application L1 ‐ cache contention Streamcluster Inter-application L1-cache Contention nce (%) 8 6 mance Differen 4 2 0 -2 Perform -4 4 -6 -8 acesim choles dytrack anneal Dedup Ferret nimate aytrace cluster Vips X264 aptions D Fluidan Blacksc Ca Streamc Fa Bod Ra Swa Co-running benchmarks 20

  21. I t Inter ‐ application contention li ti t ti  L1-cache 21 21

  22. I t Inter ‐ application contention li ti t ti  L2-cache 22

  23. I t Inter ‐ application contention li ti t ti  FSB 23

  24. Characterization Benchmarks L1 ‐ cache L2 ‐ cache FSB Blackscholes none none none Bodytrack inter inter intra C Canneal l i t intra i t inter i t intra Dedup inter intra, inter intra, inter Facesim inter inter intra Ferret intra intra, inter intra Fluidanimate inter inter intra Raytrace Raytrace none none none none intra intra Streamcluster inter inter intra Swaptions none none none Vi Vips i intra i inter i inter X264 inter intra, inter intra 24

  25. Summary  The methodology generalizes contention analysis of multi-threaded applications  New approach to characterize applications N h t h t i li ti  Useful for performance analysis of existing and future architecture or benchmarks architecture or benchmarks  Helpful for creating new workloads of diverse properties  Provides insights for designing improved contention- aware scheduling methods h d li th d 25

  26. Related Work  Cache contention  Knauerhase et al. IEEE Micro 2008  Zhuravleve et al ASPLOS 2010  Zhuravleve et al. ASPLOS 2010  Xie et al. CMP-MSI 2008  Mars et al. HiPEAC 2011  Characterizing parallel workload  Jin et al., NASA Technical Report 2009  PARSEC benchmark suite  Bienia et al. PACT 2008  Bhadauria et al IISWC 2009  Bhadauria et al. IISWC 2009 26

  27. Thank you! Thank you! 27

Recommend


More recommend