Low-Level Memory Optimisations at the High-Level with - PowerPoint PPT Presentation

Juliana Franco Martin Hagelin Tobias Wrigstad Sophia Drossopoulou The OHMM framework Low-Level Memory Optimisations at the High-Level with Ownership-like Annotations

Do you want fast programs? • More cores? More threads? Write better parallel and concurrent code? • Data layout in memory can have a great impact in your program’s performance! • Reduce cache misses • or help the prefetcher Example: array[N] of arrays[N] vs array[N*N] 1,325 * 10 6 cache-misses 833 * 10 6 cache-misses 28.04 seconds 20.49 seconds

A little bit of context on hardware http://mechanical-sympathy.blogspot.co.uk/2013/02/cpu-cache-flushing-fallacy.html

A little bit of context on hardware read purple data Core: Cache: Memory:

A little bit of context on hardware read purple data 65ns Core: Cache miss Cache: Memory:

A little bit of context on hardware read purple 65ns Core: Cache miss fetch purple data from memory Cache: Memory:

A little bit of context on hardware read purple 65ns Core: Cache miss fetch purple data from memory read purple again 3ns Cache hit Cache: Memory:

A little bit of context on hardware read purple 65ns Core: Cache miss fetch purple data from memory read purple again 3ns Cache hit read red data 3ns Cache hit Cache: Memory:

  Existing techniques class Video   id: int views: int likes: int   class VideoList   vs: Array[Video] V 1 V 2 V 3 V 4 def popularVideos(pivot: int ): void   // iterates over all videos

  Existing techniques class Video   Bar Foo id: int views: int likes: int   class VideoList   Foo vs: Array[Video] Bar def popularVideos(pivot: int ): void   // iterates over all videos

  Existing techniques class Video   id: int views: int likes: int   class VideoList   vs vs: Array[Video] video def popularVideos(pivot: int ): void   pool // iterates over all videos Object Pooling

  Existing techniques class Video   id: int views: int likes: int   class VideoList   vs: Array[Video] I’m loading data to cache def popularVideos(pivot: int ): void   that will never be used foreach v in this .vs do if v.views > pivot then print(v.id, v.views, v.likes)

  Existing techniques class Video   subpool id: int views: int likes: int   vs video class VideoList   vs: Array[Video] subpool def popularVideos(pivot: int ): void   foreach v in this .vs do Object Splitting if v.views > pivot then print(v.id, v.views, v.likes)

• It is known that these techniques can improve performance • And programmers use it a lot • Ex: array of structs vs struct or arrays • However: • they are too low level • the concept of struct or object is lost • the code becomes difficult to write and to modify

  class Video   id: int class VideoList   views: int ids: int [N] likes: int   views: int [N] likes: int [N] class VideoList   vs: Array[Video] def popularVideos(pivot: int ): void   for ( int i = 0; i < N; i++) do def popularVideos(pivot: int ): void   if this .views[i] > pivot then foreach v in this .vs do print( this .ids[i], this .views[i], this .likes[i]) if v.views > pivot then print(v.id, v.views, v.likes)

class VideoList   id_likes: ( int , int )[N] views: int [N] def popularVideos(pivot: int ): void   for ( int i = 0; i < N; i++) do if this .views[i] > pivot then print( this .id_likes[i].fst, this .views[i], this .id_likes[i].snd)

Our solution We want to provide a high-level way of specifying the data structures which does not affect the way they are used Martin

  This code for… class Video   id: int class VideoList   views: int ids: int [N] likes: int   views: int [N] likes: int [N] class VideoList   vs: Array[Video] def popularVideos(pivot: int ): void   for ( int i = 0; i < N; i++) do def popularVideos(pivot: int ): void   if this .views[i] > pivot then foreach v in this .vs do print( this .ids[i], this .views[i], this .likes[i]) if v.views > pivot then print(v.id, v.views, v.likes) … this behaviour

  Layout annotations class Video<o>   id: int views: int likes: int   class VideoList<o, o’>   vs: Array[Video<o’>]   Pool and Object Allocation new VideoList< none, none >

  Layout annotations class Video<o>   id: int views: int likes: int   class VideoList<o, o’>   vs: Array[Video<o’>]   Pool and Object Allocation Pool pool of Video in   new VideoList< none, pool> vs video pool

Clustering annotations vs Pool pool of Video in   video new VideoList< none, pool> pool subpool Pool pool of Video =   cluster {id, likes}   + cluster {views}   vs in   new VideoList< none, pool> video subpool

How do we use this data structure? def popularVideos(pivot: int ): void   let vl = new VideoList< none, pool> in foreach v in this .vs do vl.vs[45678].likes ++ if v.views > pivot then print(v.id, v.views, v.likes) Pool pool of Video =   cluster {id} + cluster {likes, views} let vl = new VideoList< none , pool> in let vl = new VideoList< none , none > in vl.vs[45678].likes ++ vl.vs[45678].likes ++ print(vl.vs[45678].views) print(vl.vs[45678].views) Pool pool of Video =   cluster {id, likes, views} How is this possible? let vl = new VideoList< none , pool> in vl.vs[45678].likes ++ print(vl.vs[45678].views)

1. A low-level language that does all the hard work   2. A compiler that uses the annotations to compile HL code to equivalent LL code Martin

A little bit on the low-level language Instructions: Example:

A little bit on the compiler x = alloc (Video) x = new Video< none > y = read (x, likes) y = x. likes z = y + 10 x.likes = y + 10 write (x, likes, z) p1 = pcreate (Video, [id, likes], [views]) Pool p1 of Video = x = palloc (p1) cluster {id, likes} + cluster {views} y = pread (x, 0, 1) x = new Video<p1> z = y + 10 y = x. likes write (x, 0, 1, z) x.likes = y + 10

Contributions • Separation of functional concerns from the layout concerns • At a higher-level: an object is still a single unit , that is somewhere in memory. • Layout annotations describe how pools are organised but object access does not need to reflect that. • Therefore, the code easier to write and modify , and also efficient . • But also much more: • The high-level language is type sound , and given that we correctly compile it, we know that low-level program behaviour is equivalent to the high-level behaviour.

Garbage Collection Sub-typing Value Semantics Iterators Concurrency and parallelism Benchmarks, benchmarks …

Conclusion • OO sequential language • OO sequential language • Ownership-like annotations • Ownership-like annotations OHMM HL • Splitting annotations • Splitting annotations • Translation using the layout • Translation using the layout Compilation annotations annotations • Interface for the low-level • Interface for the low-level OHMM LL framework with instructions to framework with instructions to work with pools work with pools • Pooling • Splitting C Framework • Pointer Compression • Pool iterators • Copying GC

Thank you! Questions?

Low-Level Memory Optimisations at the High-Level with - PowerPoint PPT Presentation

Juliana Franco Martin Hagelin Tobias Wrigstad Sophia Drossopoulou The OHMM framework Low-Level Memory Optimisations at the High-Level with Ownership-like Annotations Do you want fast programs? More cores? More threads? Write better

Subsea Facilities Decommissioning Selected Practical Optimisations and Considerations

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

CS4617 Computer Architecture Lecture 5: Memory Hierarchy 3 Dr J Vaughan September 22, 2014 1/37

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

High level OCaml optimisations Pierre Chambart, OCamlPro OCaml 2013, 23 September 2013 OCaml is

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

No CDN On-net Off-net Deep off-net User Experience Low Medium High Very High

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Real Real- -Time Systems Time Systems Low- Low -level programming level programming Low-

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction Principle of Locality

Caching 3 1 last time tag / index / ofgset lookup in associative caches replacement policies

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Memory Virtualization: Swapping and Demand Paging Policies 1 University of New Mexico Beyond

Lecture 12: Memory hierarchy & caches A modern memory subsystem combines fast small

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

Low-Level Memory Optimisations at the High-Level with - PowerPoint PPT Presentation

Juliana Franco Martin Hagelin Tobias Wrigstad Sophia Drossopoulou The OHMM framework Low-Level Memory Optimisations at the High-Level with Ownership-like Annotations Do you want fast programs? More cores? More threads? Write better

Subsea Facilities Decommissioning Selected Practical Optimisations and Considerations

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

CS4617 Computer Architecture Lecture 5: Memory Hierarchy 3 Dr J Vaughan September 22, 2014 1/37

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

High level OCaml optimisations Pierre Chambart, OCamlPro OCaml 2013, 23 September 2013 OCaml is

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

No CDN On-net Off-net Deep off-net User Experience Low Medium High Very High

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Real Real- -Time Systems Time Systems Low- Low -level programming level programming Low-

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

CSE 502: Computer Architecture Memory Hierarchy &amp; Caches Motivation 10000 Performance

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction Principle of Locality

Caching 3 1 last time tag / index / ofgset lookup in associative caches replacement policies

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Memory Virtualization: Swapping and Demand Paging Policies 1 University of New Mexico Beyond

Lecture 12: Memory hierarchy &amp; caches A modern memory subsystem combines fast small

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance

Lecture 12: Memory hierarchy & caches A modern memory subsystem combines fast small