Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab - PowerPoint PPT Presentation

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab February 5, 2013 1 / 13

1 Session Introduction 2 Kernel-compressed Memory 2 / 13

Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts 3 / 13

Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts ∙ On the Grid: 2 GB per core 3 / 13

Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts ∙ On the Grid: 2 GB per core ∙ ARM servers (or: hyper-threading): 1 GB per core/thread 3 / 13

Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts ∙ On the Grid: 2 GB per core ∙ ARM servers (or: hyper-threading): 1 GB per core/thread ∙ GPUs: order of magnitude less ∙ Xeon Phi (MIC): 100 MB per core memory per core, change of memory model 3 / 13

Explored Memory Saving Techniques Summary of so far explored memory saving techniques: Memory Sharing ∙ Fork and copy-on-write Fork should be done reasonably late ∙ Kernel SamePage Merging Sharing is done automatically at the cost of speed ∙ Multi-threaded application (Geant4-MT) Can go beyond page-wise sharing in the fork model Reduction of Memory Consumption ∙ Kernel Compressed Memory ( zRam , frontswap , cleancache ) Virtual swap area used to compress unused memory ∙ X32 ABI: x86_64 semantics with 32bit pointers Restricts address space to 4 GB (which should be acceptable) These techniques are all (relatively) non-intrusive 4 / 13

Discussion Items Job scheduling ∙ For memory sharing: jobs with similar input data should be co-scheduled ∙ In general: a good mix of jobs should be scheduled Techniques provided by the Linux kernel ∙ Many of the new features are not available in SL6 ∙ Virtual Machines can be used to couple a new kernel with an SL6 user land ∙ Automatically adjusting kernel parameters can be difficult New platforms ∙ There might be a need to recompile (and verify) the software stack for ARM and/or X32 5 / 13

1 Session Introduction 2 Kernel-compressed Memory 6 / 13

Kernel-compressed Memory – Principle ∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks) Application Pages Kernel Pages Swap Device 7 / 13

Kernel-compressed Memory – Principle ∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks) Application Pages Kernel Pages LZO Compression /dev/zram0 7 / 13

Kernel-compressed Memory – Principle ∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks) Application Pages Kernel Pages LZO Compression /dev/zram0 Change in strategy: not swap at all ↦→ swap whenever possible ( /proc/sys/vm/swappiness ) 7 / 13

Kernel-compressed Memory and cgroups The system memory pressure and the swappiness are not fine-grained enough handles for measurements Linux cgroups allow to put the application into a limited memory container: $ mkdir /sys/fs/cgroup/memory/restricted $ echo $((150*1024*1024)) > \ /sys/fs/cgroup/memory/restricted/memory.limit_in_bytes $ echo $PID > /sys/fs/cgroup/memory/restricted/tasks 8 / 13

Kernel-compressed Memory – Figures AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN) Normal Run 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 400 450 Time [s] 9 / 13

Kernel-compressed Memory – Figures AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN) cgroup memory restriction to 950 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 400 450 Time [s] 9 / 13

Kernel-compressed Memory – Figures AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN) cgroup memory restriction to 240 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 400 450 Time [s] 9 / 13

Kernel-compressed Memory – Figures AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN) Normal Run 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 1400 1600 Time [s] 10 / 13

Kernel-compressed Memory – Figures AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN) cgroup memory restriction to 900 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 1400 1600 Time [s] 10 / 13

Zero Pages X-Check: scan through a core dump of the application Can we get rid of these hundreds of Megabytes of continuous zeros? ∙ No change by using automatic garbage collection (Boehm’s GC) ∙ Zero pages in LHCb DaVinci: ≈ 700 MB out of 2 . 3 GB ∙ Zero pages in CMS reconstruction • 180 MB out of 900 MB without output • 280 MB out of 1 . 4 GB with output 11 / 13

Forensics: First Results Idea: Inspect memset() calls >4 kB Dead pages (AliRoot reco) Remaining zero pages ∙ Excluded: read() , mmap() ∙ ≈ 40 % zero pages traced back to source code ∙ Excluded: ROOT buffers ∙ Breaks down to half a dozen ∙ Measurement uncertainties at memsets with high impact memset boundaries ∙ No hits after detector ∙ Only literal memset() covered, initialization standard constructors: ∙ Scattered over uses of int *a = TClonesArray new int[1024*1024](); 12 / 13

Next Steps 1 Forensics: Track back large zero-runs to a malloc() 2 How to choose zram parameters for an optimal tradeoff wrt. throughput? Perhaps zram can also be used as an “overflow” mechanism to make sure that a job finishes 13 / 13

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab - PowerPoint PPT Presentation

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab February 5, 2013 1 / 13 1 Session Introduction 2 Kernel-compressed Memory 2 / 13 Introduction Assumptions The number of cores grows faster than the amount of memory

From Saving the Princess to From Saving the Princess to Saving the Cow Saving the Cow Content

Saving Time Bill Rising StataCorp LLC 2018 Stata Conference Columbus, OH July 20, 2018 Saving

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Saving Data in iOS Saving Data with NSString and NSData Saving on iOS Every iOS app is its own

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

SAVING Changing the Economics of Rental 4/9/2012 Housing By Jonathan X. Cote Worth Saving

Investment vs. Saving How is investing different from saving? Investing means putting money

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Characterization and Validation of Cloud-Cleared Radiances E.F. Fishbein H.H. Aumann S-Y Lee

Orthogonal Rational Functions, Associated Rational Functions and Functions of the Second Kind

Linear algebra and differential equations (Math 54): Lecture 25 Vivek Shende April 26, 2019

Periodic Functions and Orthogonal Systems Periodic Functions Even and Odd Functions

Orthogonal Bases Are the Towards Formulating . . . Best: A Theorem Justifying How to Describe .

Mechanical Sympathy for Elephants Reducing I/O and memory stalls Thomas Munro, PGCon 2020

Multiple-Environment Markov Decision Processes: Efficient Analysis and Applications ICAPS 2020

Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many

Sambuz

Useful Links

Newsletter

Mail Us