the papi libpfm4 transition
play

The PAPI libpfm4 Transition and unrelated Software Prefetch - PowerPoint PPT Presentation

The PAPI libpfm4 Transition and unrelated Software Prefetch Research Vince Weaver ICL Lunch Talk 11 February 2011 Part I: The PAPI libpfm4 Transition 1 Layers of Abstraction PAPI_TOT_INS PAPI INSTRUCTION_RETIRED Event Name Translator


  1. The PAPI libpfm4 Transition and unrelated Software Prefetch Research Vince Weaver ICL Lunch Talk 11 February 2011

  2. Part I: The PAPI libpfm4 Transition 1

  3. Layers of Abstraction PAPI_TOT_INS PAPI INSTRUCTION_RETIRED Event Name Translator 0x5300c0 Operating System Hardware 2

  4. libpfm3 • Used by PAPI since version 3.0 for Linux: perfctr, perfmon2, perf events • No longer supported • No support for newer chips • Not really designed for perf events 3

  5. libpfm4 • Still under development • Supports newest processors • Designed for perf events • Just incompatible enough with libpfm3 to be annoying 4

  6. Features Not Supported by Linux 2.6.38 but PAPI/libfpm4 will support once there is support • AMD Lightweight Profiling (LWP) • Intel HW Cycle-count Register • Uncore Events (Intel, AMD 15h, Power) • Nehalem Offcore Response • Sampling Interfaces (IBS / PEBS) • Newer Processors (Sandy Bridge, Bulldozer) 5

  7. Can Current PAPI Handle All of These New Features? 6

  8. Original Event Layout PAPI Event 31 0 PAPI_PRESET_MASK PAPI L1 TCM 7

  9. PAPI 3.0 (2004) PAPI Event 31 0 PAPI_NATIVE_MASK PAPI_PRESET_MASK LAST LEVEL CACHE REFERENCES 8

  10. PAPI 3.5 (2006) PAPI Event 31 15 0 UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK L2 RQSTS:SELF DEMAND MESI 9

  11. PAPI 3.6 (2008) PAPI Event 31 15 11 7 0 UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK BRANCH RETIRED:MMNP:MMNM:MMTP:MMTM 10

  12. PAPI 4.0 (2010) PAPI Event 31 25 15 11 7 0 COMPONENT UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK LM SENSORS.applesmc-isa-0300.temp10.temp10 input 11

  13. PAPI 5.0??? PAPI Event 31 25 15 11 7 0 COMPONENT PMU UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK nhm::BR INST RETIRED:ALL BRANCHES nhm unc::UNC DRAM PAGE MISS ix86arch::UNHALTED CORE CYCLES perf::ext4:ext4 discard blocks 12

  14. Not Enough Bits! PAPI Event 31 25 15 11 7 0 COMPONENT PMU UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK PM DC PMC 9:lpid mask=0xff:lpid=0x22:pid mask=0x3fff:pid=0x1b2d:marking nhm::OFFCORE RESPONSE 0:DMND DATA RD:DMND RFO:REMOTE DRAM:LOCAL DRAM 13

  15. Move to libpfm4 and String-based Events • Have a dynamically updated table containing the event names in use as full strings • A 32-bit PAPI native event is assigned to each string, allowing backward compatibility with current PAPI interface • Must make sure that event name lookup is not on the critical path to avoid performance regressions 14

  16. Part II: Investigating Prefetching Using Hardware Performance Counters 15

  17. Quick Look at Core2 HW Prefetch • Instruction prefetcher • L1 Data Cache Unit Prefetcher (streaming). Ascending data accesses prefetch next line • L1 Instruction Pointer Strided Prefetcher. Looks for strided access from particular load instructions. Forward or Backward up to 2k apart • L2 Data Prefetch Logic. Fetches to L2 based on the L1 DCU 16

  18. x86 SW Prefetch Instructions (AMD) • PREFETCHNTA – SSE1, non temporal (use once) • PREFETCHT0 – SSE1, prefetch to all levels • PREFETCHT1 – SSE1, prefetch to L2 + higher • PREFETCHT2 – SSE1, prefetch to L3 + higher • PREFETCH – AMD 3DNOW! prefetch to L1 • PREFETCHW – AMD 3DNOW! prefetch for write 17

  19. Investigating adding a PAPI PRF SW Pre-defined Event • Can multiple machines count SW Prefetches? • Does the behavior of the events match expectations? • Will people use the preset? 18

  20. Core2 • SSE PRE EXEC:NTA – counts NTA • SSE PRE EXEC:L1 – counts T0 ( fxsave +2, fxrstor +5) • SSE PRE EXEC:L2 – counts T1/T2 • Problem: Only 2 counters available on Core2 19

  21. AMD (Istanbul and Later) • PREFETCH INSTRUCTIONS DISPATCHED:NTA • PREFETCH INSTRUCTIONS DISPATCHED:LOAD • PREFETCH INSTRUCTIONS DISPATCHED:STORE • These events appear to be speculative, and won’t count SW prefetches that conflict with HW prefetches 20

  22. Atom • PREFETCH:PREFETCHNTA • PREFETCH:PREFETCHT0 • PREFETCH:SW L2 • These events will count SW prefetches, but numbers counted vary in complex ways 21

  23. Does anyone use SW Prefetch? • gcc by default disables SW prefetch unless you specify -fprefetch-loop-arrays • icc disables unless you specify -xsse4.2 -op-prefetch=4 • glibc has hand-coded SW prefetch in memcpy() • Prefetch can hurt behavior: – Can throw out good cache lines, – Can bring lines in too soon, – Can interfere with the HW prefetcher 22

  24. SW Prefetch Distribution SPEC CPU 2000, Core2, gcc -fprefetch-loop-arrays Load Distribution Loads T0 T1/T2 NTA 80B Load Instructions 60B 40B 20B N/A N/A N/A 164.gzip.graphic 164.gzip.program 164.gzip.log 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 256.bzip2.graphic 255.vortex.3 256.bzip2.program 256.bzip2.source 300.twolf Load Distribution Loads T0 T1/T2 NTA 150B Load Instructions 100B 50B 7 7 7 7 C 0 C C C 0 C 0 0 7 7 7 7 7 7 9 9 9 9 7 7 F F F F a F 0 0 e F p F F F F s 1 7 k m e m d u l c s d k i e e 1 4 a s e m a c s r i p l m g . . u 3 p i w i t t r c a w g p l r r q a a a . a a a e u r s m 7 e . m . p a g . . c 8 l t 1 . 7 . . x u 1 . . 9 9 3 a 8 9 f 2 3 1 . . s i 0 w 7 8 7 7 8 f 1 8 1 7 7 . . 3 1 7 1 1 1 7 9 0 . 1 1 1 8 1 8 1 0 6 1 2 1 23

  25. Normalized SW Prefetch Runtime on Core2 (Smaller is Better) Integer SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays Normalized Runtime 1 0.5 N/A N/A N/A 0 164.gzip.graphic 164.gzip.program 164.gzip.log 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 256.bzip2.graphic 255.vortex.3 256.bzip2.program 256.bzip2.source 300.twolf FP SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays Normalized Runtime 1 0.5 0 7 7 7 7 C 0 C C C 0 C 0 0 7 7 7 7 7 7 9 9 9 9 7 7 a 0 0 e p F F F F F F F F F F s 1 7 k m e m d u e l a c s d k i e 1 4 s s i l e m a 3 c i r p m g . . u p i w g t t r c a a w p l r r q a a . a a a e u r s m a 7 e . m t . p g . . c 8 l x 1 . 7 9 9 . . u 1 . . . 3 a 8 9 f i 0 2 3 1 8 7 7 . s w 7 8 f 1 8 1 7 7 . . 3 1 7 1 1 1 7 1 9 0 . 1 1 8 1 8 1 0 6 1 2 1 24

  26. The HW Prefetcher on Core2 can be Disabled 25

  27. Runtime with HW Prefetcher Disabled Normalized against Runtime with HW Prefetcher Enabled on Core2 (Smaller is Better) Normalized Runtime when HW Prefetch Disabled plain w/ SW Prefetch 1.5 1.82 1.84 Normalized Runtime 1 0.5 N/A N/A N/A 0 c g m m e e e 6 0 r e b f y r k a r 5 4 0 7 l d t p 1 2 3 c m e f p c e e i c l i o c c t 6 0 t a t o y 3 0 5 5 a n a . . . i c o h a o u x a m f s i e x x x h a l r a 1 2 l a o i e 5 7 8 9 m a g r w p . r d u o e r i r j f e e e p r u p l . . c . r c a m . . . . r r . a g n o p r c c . g 1 c a k k k k f e 4 t t t a g o t i c s . k f e r r r . r z o . . c c e 8 . p n m m m m i r o 0 g a s r r c . 6 . h d k p 5 o o o g s g r . p p g g t c 1 . o n r . 0 . p r p g n 8 7 s b b b b . a . 2 v v v . p 2 . . v . . c e o k k 3 p 4 . p v 6 6 . i 9 u m . . . 2 . p z i . . 6 . g 1 . e r l r l r l r l m m 5 5 5 2 p i 6 i 5 c 1 r p z i z g 5 7 7 . 2 . . e e e e . 5 5 5 p i g 1 z 7 7 c 6 2 n b k b i z g 7 1 1 5 p p p p 2 2 2 z g . 1 1 g 7 5 o l m l z i b . . 4 1 2 . . . . r r b 4 . 4 . 1 2 e 3 3 3 3 e e b . 4 6 6 b . 6 6 6 . 5 5 5 5 p 6 . 6 1 7 2 p l 6 5 1 1 2 2 2 2 r . 5 1 . e 3 2 1 5 3 2 5 p 5 2 5 2 2 . 2 3 5 2 Normalized Runtime when HW Prefetch Disabled plain w/ SW Prefetch 2.47 2.58 3.82 3.66 2 Normalized Runtime 1.5 1 0.5 0 7 7 7 7 C 0 C C C 0 C 0 0 7 7 7 7 7 7 9 9 9 9 7 7 F F F F a F 0 0 e F p F F F F s 1 7 k m e m d u e l 1 4 a c s d k i e m s s i l m . . u e a 3 c i r p g t t p i w g r r q r a c a a w p . l e a m 7 a a a e u m r p s a c . t . g . . 8 l x 1 . . . 7 9 9 . a . f u 1 2 3 . 3 8 9 . i 0 1 8 7 7 f 1 s w 7 7 8 . 1 8 3 7 7 1 1 7 . . 1 1 1 1 9 0 1 1 8 8 1 0 6 1 2 1 26

Recommend


More recommend