latr lazy translation coherence
play

Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , - PowerPoint PPT Presentation

Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , Sanidhya Kashyap, J y , an Vesel Zi Yan , Taesoo Kim, Abhishek Bhattacharjee , Tushar Krishna Rutgers University Georgia Institute of Technology * Co-First


  1. Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , Sanidhya Kashyap, J´ y ‡ , an Vesel´ Zi Yan ‡ , Taesoo Kim, Abhishek Bhattacharjee ‡ , Tushar Krishna ‡ Rutgers University Georgia Institute of Technology * Co-First Authors March 28, 2018 Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 1 / 24

  2. Motivation Large NUMA machines Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

  3. Motivation Large NUMA machines Terabytes of memory Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

  4. Motivation Large NUMA machines Terabytes of memory Microsecond latency Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

  5. Motivation ⇒ Problem of Microsecond Latency in System Services ⇒ TLB Coherence is Contributor in Important Subset Large NUMA machines Terabytes of memory Microsecond latency Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

  6. Impact of TLB coherence on applications Multi-core MapReduce application Prior research: 10x increase in shootdown time with increasing core counts Web servers (e.g., Apache) Prior research and our findings: ≈ 35% of time spent in TLB shootdown Die-stacked Memory Swapping between on-chip and off-chip memory Disaggregated Memory Swapping between local and remote memory Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 3 / 24

  7. Impact of TLB coherence on applications Multi-core MapReduce application Prior research: 10x increase in shootdown time with increasing core counts Web servers (e.g., Apache) Prior research and our findings: ≈ 35% of time spent in TLB shootdown Die-stacked Memory Swapping between on-chip and off-chip memory Disaggregated Memory Swapping between local and remote memory ⇒ Can we mitigate this costly TLB shootdown? Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 3 / 24

  8. Table of contents TLB Shootdown Background 1 Latr : Asynchronous TLB Shootdowns 2 Evaluation 3 Conclusion 4 Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 4 / 24

  9. Table of contents TLB Shootdown Background 1 Latr : Asynchronous TLB Shootdowns 2 Evaluation 3 Conclusion 4 Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 5 / 24

  10. Translation lookaside buffer: Introduction Cache for virtual → physical mapping, per-core structures Accessed on every load/store Unlike data caches (L3, etc.), coherence managed by OS TLB coherence significantly impacts application performance Virtual Address Miss: TLB PTE Page Table Physical Walk Address PMD Hit: PUD Physical Address PGD Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 6 / 24

  11. TLB coherence: Background Hardware-based Approaches Providing cache coherence to TLBs ISA-level instruction support (ARM) Microcode-based approaches Software-based Approaches Current commodity OS design: Use Inter-Processor Interrupts (IPI) Optimization: Reduce number of shootdowns, better tracking Multikernel design: Use Message-Passing Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 7 / 24

  12. TLB coherence: Background Hardware-based Approaches Providing cache coherence to TLBs ISA-level instruction support (ARM) ⇒ More Hardware Complexity Microcode-based approaches Software-based Approaches ⇒ TLB shootdowns still significant Current commodity OS design: Use Inter-Processor Interrupts (IPI) Optimization: Reduce number of shootdowns, better tracking Multikernel design: Use Message-Passing Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 7 / 24

  13. TLB shootdown internals in Linux munmap() on core 1, application running on cores 1, 2, and 5: Application ❶ App 1 App 2 Idle Idle App 5 Idle Idle Idle ... OS OS OS Operating System Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ Timeline: Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

  14. TLB shootdown internals in Linux munmap() on core 1, application running on cores 1, 2, and 5: Application ❶ munmap() ❶ App 1 App 2 Idle Idle App 5 Idle Idle Idle ... OS OS OS Operating System Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ Timeline: Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

  15. TLB shootdown internals in Linux Context switch on core 1, local TLB shootdown: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ... OS OS OS ❷ Operating System Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ Timeline: Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

  16. TLB shootdown internals in Linux Notify cores 2 and 5 via IPI, application blocked on core 1: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❸ Operating System Spin- wait Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ Timeline: 2.2µs Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

  17. TLB shootdown internals in Linux Execute context switch and TLB shootdown on cores 2 and 5: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❹ Remote Shootdown ❹ ❹ Operating System Spin- wait Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ ❹ Timeline: 2.2µs Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

  18. TLB shootdown internals in Linux Cores 2 and 5 respond ACK via shared memory: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❹ Remote Shootdown ❺ ❺ Operating System ❺ IPI ACK Spin- wait Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ ❹ ❺ Timeline: 2.2µs Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

  19. TLB shootdown internals in Linux Control is returned on all cores, TLB shootdown completed: Application ❶ munmap() ❻ ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❹ Remote Shootdown Operating System ❺ IPI ACK ❻ munmap() complete Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ ❹ ❺ ❻ Timeline: 2.2µs 5.9µs } Savings potential for asynchronous approach with L ATR Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

  20. Observation Synchronous TLB shootdown is expensive: Up to 6 µ s delay with two sockets Processing IPIs is expensive: Interrupt handler on remote core Long wait time on initiating core IPI send-and-wait delay: Unicast delivery of the IPIs (one at a time) Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 9 / 24

  21. TLB shootdown: A necessary evil Cost of a simple memory unmap operation ( munmap() ): 1 page on 16 cores with 2 sockets: up to 8 µ s ≈ 70% from TLB shootdown alone More expensive with more sockets: munmap() 8 7 6 Latency ( µ s) 5 4 3 2 1 Socket 1 0 2 4 6 8 10 12 14 16 Cores Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 10 / 24

  22. TLB shootdown: A necessary evil Cost of a simple memory unmap operation ( munmap() ): 1 page on 16 cores with 2 sockets: up to 8 µ s ≈ 70% from TLB shootdown alone More expensive with more sockets: munmap() 8 7 6 Latency ( µ s) 5 4 3 2 1 Socket 2 Sockets 1 0 2 4 6 8 10 12 14 16 Cores Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 10 / 24

  23. TLB shootdown: A necessary evil Cost of a simple memory unmap operation ( munmap() ): 1 page on 16 cores with 2 sockets: up to 8 µ s ≈ 70% from TLB shootdown alone More expensive with more sockets: munmap() 8 TLB Shootdown 7 6 Latency ( µ s) 5 4 3 2 1 0 2 4 6 8 10 12 14 16 Cores Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 10 / 24

  24. Table of contents TLB Shootdown Background 1 Latr : Asynchronous TLB Shootdowns 2 Evaluation 3 Conclusion 4 Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 11 / 24

  25. In this talk: Latr Latr : La zy Tr anslation Coherence Perform asynchronous TLB shootdown Remove remote shootdown from the critical path Take advantage of change in ABI without affecting applications’ correctness Use shared memory instead of IPI Eliminate send-and-wait delay of IPIs Scope: free operations (in this talk) migration operations (see our paper) Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 12 / 24

  26. In this talk: Latr Latr : La zy Tr anslation Coherence Perform asynchronous TLB shootdown Remove remote shootdown from the critical path Take advantage of change in ABI without affecting applications’ correctness Use shared memory instead of IPI Eliminate send-and-wait delay of IPIs Scope: free operations (in this talk) migration operations (see our paper) ⇒ But: How to perform asynchronous shootdown? Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 12 / 24

Recommend


More recommend