Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped - - PowerPoint PPT Presentation

say goodbye to off heap caches
SMART_READER_LITE
LIVE PREVIEW

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped - - PowerPoint PPT Presentation

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 , Anastasios Papagiannis 1 , Foivos Zakkak 2 , Polyvios Pratikakis 1 , and Angelos Bilas 1 1 University of Crete & Foundation of Research and


slide-1
SLIDE 1

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O

Iacovos G. Kolokasis1, Anastasios Papagiannis1, Foivos Zakkak2, Polyvios Pratikakis1, and Angelos Bilas1

1University of Crete & Foundation of Research and Technology Hellas (FORTH), Greece 2University of Manchester (Currently at Red Hat, Inc.)

slide-2
SLIDE 2

Outline

  • Motivation
  • TeraCache design for multiple heaps with different properties
  • How we reduce GC time?
  • How we grow TeraCache over a device?
  • Evaluation
  • Conclusions

2

slide-3
SLIDE 3

Increasing Memory Demands!

  • Big data systems cache large

intermediate results in-memory

  • Speed-up iterative workloads
  • Analytics datasets grow at a high rate

3 [Source: www.seagate.com | Seagate]

  • Today ~50ZB
  • By 2025 ~175ZB

50ΖΒ 175ΖΒ

3x

  • Big data systems request TBs of memory per server
slide-4
SLIDE 4

Spark: Caching Impacts Performance

4

  • Jobs cache intermediate data in memory
  • Subsequent jobs reuse cached data
  • Caching reduces execution time by
  • rders of magnitude
  • Naively, caching data needs large heaps

which implies a lot of DRAM 90%

slide-5
SLIDE 5

Caching Beyond Physical DRAM

5

  • DRAM capacity scaling reaches its limit [Mutlu-IMW 2013]
  • DRAM scales to GB / DIMM
  • DRAM capacity is limited by DIMM slots / servers
  • NVMe SSDs scale to TBs / PCIe slot at lower cost
  • Already Today: Spark uses off-heap store on fast devices
slide-6
SLIDE 6

Between a Rock and a Hard Place! GC vs Serialization Overhead

6

Execution Memory Storage Memory (on-heap cache)

Pros Cons On-heap Cache No Serialization High GC Off-heap Cache Low GC High Serialization

Merge the benefits from both worlds!

Executor Memory Executor Memory

Execution Memory Storage Memory (on-heap cache) (off-heap cache) Disk serialize/deserialize

slide-7
SLIDE 7

Outline

  • Motivation
  • TeraCache design for multiple heaps with different properties
  • How we reduce GC time?
  • How we grow TeraCache over a device?
  • Evaluation
  • Conclusions

7

slide-8
SLIDE 8

Different Heaps for Different Object Types

  • Analytics computations generate mainly two types of objects
  • Short-lived, (runtime managed)
  • Long-lived, similar life-time, (application managed)
  • JVM-heap on DRAM which is garbage collected
  • Locate short-lived objects
  • For computation usage (task memory usage)
  • TeraCache-heap which is never garbage collected
  • Contains group of similar life-span objects (e.g., cached data)
  • Grow over a storage device (no serialization)

8

slide-9
SLIDE 9

Split Executor Memory In Two Heaps

9

Execution Memory Storage Memory JVM-heap (GC) TeraCache (non-GC) region0 regionN . . .

Executor Memory

  • JVM-heap (GC)
  • TeraCache (non-GC)

Organize TeraCache in regions

  • Bulk free:Similar life-time objects into the same region
  • Dynamic size

Tera-heap JVM-heap

We make the JVM aware of cached data

  • Spark notifies JVM
  • Finds the transitive closure of the object​
  • Move and migrate object into a region​
slide-10
SLIDE 10

We Preserve JAVA Memory Safety

10

TeraCache-heap (no GC) Old Gen New Gen Region Region Region JVM-heap (GC)

Avoid pointer corruption between objects in two heaps No backward pointers: TeraCache → JVM-heap​

  • Stop GC to reclaim objects used by TeraCache objects​
  • Move transitive closure of the object
slide-11
SLIDE 11

We Preserve JAVA Memory Safety

11

TeraCache-heap (no GC) Old Gen New Gen Region Region Region JVM-heap (GC)

Avoid pointer corruption between objects in two heaps No backward pointers: TeraCache → JVM-heap​

  • Stop GC to reclaim objects used by TeraCache objects​
  • Move transitive closure of the object

Allow forward pointers: JVM-heap → TeraCache

  • But stop GC to traverse TeraCache

Allow internal pointers: TeraCache↔TeraCache

slide-12
SLIDE 12

Outline

  • Motivation
  • TeraCache design for multiple heaps with different properties
  • How we reduce GC time?
  • How we grow TeraCache over a device?
  • Evaluation
  • Conclusions

12

slide-13
SLIDE 13

Dividing DRAM Between Heaps

13

Executor Memory JVM DRAM Execution Memory DR1 DR2 Storage Memory JVM-Ηeap TeraCache Heap NVMe SSD mmap() How to deal with DRAM resources?

  • Iterative Jobs → reuse cache data → need large DR2 size
  • Shuffle Jobs → short-lived data → need large DR1 size
slide-14
SLIDE 14

Deal With DRAM Resources For Multi-Heaps

14

  • KM-jobs produce more short-lived data
  • More minor GCs/s →more space for DR1

3x 2x

  • We propose dynamic resizing of DR1, DR2
  • Based on page fault rate in MMIO
  • Based on Minor GCs
  • LR-jobs reuse large size of cached data
  • More page faults/s→more space for DR2
slide-15
SLIDE 15

Outline

  • Motivation
  • TeraCache design for multiple heaps with different properties
  • How we reduce GC time?
  • How we grow TeraCache over a device?
  • Evaluation
  • Conclusions

15

slide-16
SLIDE 16

Prototype Implementation

  • We implement an early prototype of TeraCache based on ParallelGC
  • Place New generation on DRAM
  • Place Old generation on the fast storage device
  • Explicitly disable GC on Old generation
  • Evaluate
  • GC overhead
  • Serialization overhead
  • Not support for reclamation of cached RDDs and dynamic resizing

16

slide-17
SLIDE 17

Preliminary Evaluation

17

  • TC improves performance up to 37%

LR (on average 25%)

  • TC improves performance up to 2x

compared to Linux swap (LR)

  • TC improves GC up to 50% LGR

(on average 46%)

2x 37% 50%

slide-18
SLIDE 18

Conclusions

  • TeraCache: A JVM/Spark co-design
  • Able to support very large heaps
  • Reduces GC time using two heaps
  • Eliminates serialization-deserialization
  • Dynamic sharing of DRAM resources across heaps
  • Improves Spark ML workloads performance by 25% on average
  • Applicable to other analytics runtimes

18

slide-19
SLIDE 19

Contact

Iacovos G. Kolokasis kolokasis@ics.forth.gr www.csd.uoc.gr/~kolokasis Institute of Computer Science (ICS) Foundation of Research and Technology (FORTH) - Hellas

  • • •

Department of Computer Science, University of Crete

19