Hardware Transactional Memory on Haswell EP Viktor Leis Technische Universität München 1 / 14
Introduction ◮ Intel’s new mid-level server platform: Haswell EP ◮ up to 18 cores per socket (up to 72 hardware threads with 2 sockets) ◮ supports hardware transactional memory (TSX) 2 / 14
Experimental Setup ◮ global fallback lock ◮ built-in Hardware Lock Elision (HLE) ◮ lock elision implemented using RTM, restarts and re-speculation ◮ workload ◮ Adaptive Radix Tree (trie, fanout 2-256), designed for main-memory database systems ◮ random lookups in tree with 64M entries ◮ 64M random inserts into (initially empty) tree 3 / 14
Intel Xeon E5-2697 v3 ◮ 14 cores (28 threads), 2.6GHz-3.6GHz, 35MB LLC ◮ 2 sockets memory controller memory controller internal link (to other ring) core 0 L3 L3 core 10 L3 core 4 core 7 L3 core 1 L3 core 11 L3 L3 core 5 core 8 L3 core 2 L3 L3 core 12 L3 core 6 core 9 L3 L3 core 3 L3 core 13 QPI interconnect (to other socket) 4 / 14
Lookups with Locking no sync 75 M ops/s 50 atomic 25 rw_spin_lock 0 1 14 28 42 56 threads 5 / 14
Lookups with HTM no sync 75 7 or more restarts M ops/s 50 3 restarts 25 2 restarts built-in HLE 1 restarts 0 restarts 0 1 14 28 42 56 threads 6 / 14
Random Inserts with HTM 120 pre − allocate + memset 26.0x 90 M ops/s pre − allocate 60 16.1x tcmalloc 12.2x 30 0.8x malloc 0 1 14 28 42 56 threads 7 / 14
HTM and NUMA ◮ lookup: 1 thread 7 threads speedup 1 cluster 9.2 53.0 5.8 × 1 socket 5.4 36.0 6.7 × 2 sockets 3.6 24.5 6.8 × ◮ insert: insert 1 thread 7 threads speedup 1 cluster 5.3 30.6 5.8 × 1 socket 4.3 26.8 6.2 × 2 sockets 3.0 20.2 6.7 × 8 / 14
Conclusions ◮ Intel’s HTM implementation can scale to NUMA systems with many many cores ◮ pitfalls at higher thread counts: ◮ built-in HLE does not scale ◮ lock elision with 20 restarts and re-speculation should be used instead ◮ even infrequent kernel traps or system calls can be a problem at higher thread counts (Amdahl’s Law) 9 / 14
Recommend
More recommend