An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore - PowerPoint PPT Presentation

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures Konstantinos Nikas Computing Systems Laboratory NTU Athens, Greece Matthew Horsnell, Jim Garside Advanced Processor Technologies Group University of Manchester

Introduction • Cores in CMPs typically share some level of the memory hierarchy • Applications compete for the limited shared space • Need for efficient use of the shared cache – Requests to off-chip memory are expensive (latency and power)

Introduction • LRU (or approximations) is typically employed • Partitions the cache implicitly on a demand basis – Application with highest demand gets majority of cache resources – Could be suboptimal (eg. streaming applications) • Thread-blind policy – Cannot detect and deal with inter-thread interference

Motivation • Applications can be classified into 3 different categories [Qureshi and Patt (MICRO ’06)] • High Utility • Applications that continue to benefit significantly as the cache space is increased

Motivation • Low Utility • Applications that do not benefit significantly as the cache space is gradually increased

Motivation • Saturating Utility • Applications that initially benefit as the cache space is increased Target : Exploit the differences in the cache utility of concurrently executed applications

Static Cache Partitioning

Static Cache Partitioning • Two major drawbacks • The system must be aware of each application’s profile • Partitions remain the same throughout the execution – Programs are known to have distinct phases of behaviour • Need for a scheme that can partition the cache dynamically – Acquire the applications’ profile at run-time – Repartition when the phase of an application changes

Dynamic Cache Partitioning • LRU's “stack property” [Mattson et al. 1970] “An access that hits in a N-way associative cache using the LRU replacement policy is guaranteed to hit also if the cache had more than N ways, provided that the number of sets remains the same.”

ABFCP : Overview • Adaptive Bloom Filter Cache Partitioning (ABFCP) I−Cache • Partitioning Module CORE n Partitioning D−Cache – Track misses and hits Module – Partitioning Algorithm . – Replacement support to Shared enforce partitions I−Cache L2 Cache CORE 0 D−Cache DRAM

ABFCP : Tracking system • Far Misses – Misses that would have been hits had the application been allowed to use more cache ways – Tracked by Bloom filters

ABFCP : Partitioning Algorithm • 2 counters per core per cache set – C LRU – C FarMiss • Each core’s allocation can be changed by ± 1 way • Estimate performance loss/gain – -1 way : Hits in the LRU position will become misses perf. loss → C LRU – +1 way : A portion of the far misses will become hits perf. gain → a * C FarMiss , a = (1 - ways/assoc)

ABFCP : Partitioning Algorithm • Select the best partition that maximises performance (hits) • Complexity – cores = 2 → possible partitions = 3 – cores = 4 → possible partitions = 19 – cores = 8 → possible partitions = 1107 – cores = 16 → possible partitions = 5196627 • Linear algorithm that selects the best partition or a good approximation thereof. – N/2 comparisons (worst case) → O(N)

ABFCP : Way Partitioning • Way Partitioning support [Suh et al. HPCA ’02, Qureshi and Patt MICRO ’06] • Each line has a core-id field • On a miss the ways occupied by the miss-causing application are counted – ways_occupied < partition_limit → victim is the LRU line of another application – Otherwise the victim is the LRU line of the miss-causing application

Evaluation • Configuration – 2,4,8 single-issue, in-order cores – Private L1 I and D caches (32KB, 4-way associative, 32B line size, 1 cycle access latency) – Unified shared on-chip L2 cache (4MB, 32-way associative, 32B line size, 16 cycle access latency) – Main memory (32 outstanding requests, 100 cycle access latency) • Benchmarks – 9 apps from JavaGrande + NAS – One application per processor – Simulation stops when one of the benchmarks finishes

Results (Dual core system)

Results (Quad core system)

Results (Eight core system)

Evaluation • Increasing promise as number of cores increase • Hardware Cost per core – BF arrays (4096 sets * 32b) → 16KB – Counters (4096 sets * 2 counters * 8b) → 8KB – L2 Cache (240KB tags + 4MB data) → 4336KB – 0.55% increase in area • 8-core system – 48KB for the core-ids per cache set – Total overhead 240KB → 5.5% increase over L2

Evaluation

Related Work • Cache Partitioning Aware Replacement Policy [Dybdhal et al. HPC ’06] – Cannot deal with applications with non-convex miss rate curves • Utility-Based cache partitioning [Qureshi and Patt MICRO ’06] – Smaller overhead – Enforces the same partition over all the cache sets

Conclusions • It is important to share the cache efficiently in CMPs • LRU does not achieve optimal sharing of the cache • Cache partitioning can alleviate its consequences • ABFCP – shows increasing promise as the number of cores increase – provides better performance than LRU at a reasonable cost (5.5% increase for an 8-core system achieves similar results to using LRU with a 50% bigger L2 cache)

Any Questions? Thank you!

Utility-Based Cache Partitioning

Utility-Based Cache Partitioning • High hardware overhead Dynamic Set Sampling (monitor only 32 lines) ‏ • – Smaller UMONs • Enforce the same partition for the whole cache – Less counters

Utility-Based Cache Partitioning

ABFCP Comparison with UCP • UCP has a lower storage overhead (70KB for an 8-core) ‏ • If it attempted to partition on a line basis, it would require 11MB per processor • ABFCP is more robust • ABFCP performs better as the number of cores increases

ABFCP Comparison with UCP

Conclusions

Evaluation • UCP acquires a more accurate profile than CPARP • Example – curr_hits = 135 – if app2 gets 6 ways then hits = 145 (UCP) ‏ – CPARP does not modify the partition

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore - PowerPoint PPT Presentation

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures Konstantinos Nikas Computing Systems Laboratory NTU Athens, Greece Matthew Horsnell, Jim Garside Advanced Processor Technologies Group University of Manchester

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

STUFF FILTER POPULARITY FILTER PERSONALITY FILTER TALENT FILTER GODS FILTER WE HAVE THE

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Bloom Filter & Hashing Barna Saha Bloom Filter Checks for SET MEMBERSHIP efficiently Is

Bloom Filter-based Stateless Multicast va Hosszu hosszu@tmit.bme.hu Outline Multicast in

Recursive State Estimation 2 Lecture 8 Recap Today Kalman Filter Extended Kalman Filter

Kalman filter Kalman Filter Kalman filter is used to filter true system states from noisy

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Filter Design Specifications Chaiwoot Boonyasiriwat September 29, 2020 Filter Design

THE REPO DOES NOT FORGET STEP 1: GIT FILTER-BRANCH git filter-branch --index-filter 'git rm -rf

Mayfield in Bloom 2019 Categories: Large Village Parish in Bloom Judging day 4th

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

1 Example: Java client (TCP) Example: Java client (TCP), cont. import java.io.*; import

Socket Programming Rohan Murty Hitesh Ballani Last Modified: 2/8/2004 8:31:51 AM Slides

A change of variable formula with It o correction term* Jason Swanson Department of

Docker Security Workshop Goals of this Workshop Understand and get Understand and get

Zeek - Incident Response and Beyond Aashish Sharma LBNL ZeekWeek-2019 UNIVERSITY OF CALIFORNIA

Comparison of Search Engines Non-Neutral and Neutral Behaviors P. Coucheney, P. Maill, Bruno

Counterexample-guided Cartesian Abstraction Refinement and Saturated Cost Partitioning for

Unique continuation and new Hohenberg-Kohn theorems Louis Garrigue Cirm, October 24, 2019 Louis

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore - PowerPoint PPT Presentation

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures Konstantinos Nikas Computing Systems Laboratory NTU Athens, Greece Matthew Horsnell, Jim Garside Advanced Processor Technologies Group University of Manchester

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

STUFF FILTER POPULARITY FILTER PERSONALITY FILTER TALENT FILTER GODS FILTER WE HAVE THE

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Bloom Filter &amp; Hashing Barna Saha Bloom Filter Checks for SET MEMBERSHIP efficiently Is

Bloom Filter-based Stateless Multicast va Hosszu hosszu@tmit.bme.hu Outline Multicast in

Recursive State Estimation 2 Lecture 8 Recap Today Kalman Filter Extended Kalman Filter

Kalman filter Kalman Filter Kalman filter is used to filter true system states from noisy

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Filter Design Specifications Chaiwoot Boonyasiriwat September 29, 2020 Filter Design

THE REPO DOES NOT FORGET STEP 1: GIT FILTER-BRANCH git filter-branch --index-filter 'git rm -rf

Mayfield in Bloom 2019 Categories: Large Village Parish in Bloom Judging day 4th

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

1 Example: Java client (TCP) Example: Java client (TCP), cont. import java.io.*; import

Socket Programming Rohan Murty Hitesh Ballani Last Modified: 2/8/2004 8:31:51 AM Slides

A change of variable formula with It o correction term* Jason Swanson Department of

Docker Security Workshop Goals of this Workshop Understand and get Understand and get

Zeek - Incident Response and Beyond Aashish Sharma LBNL ZeekWeek-2019 UNIVERSITY OF CALIFORNIA

Comparison of Search Engines Non-Neutral and Neutral Behaviors P. Coucheney, P. Maill, Bruno

Counterexample-guided Cartesian Abstraction Refinement and Saturated Cost Partitioning for

Unique continuation and new Hohenberg-Kohn theorems Louis Garrigue Cirm, October 24, 2019 Louis

Bloom Filter & Hashing Barna Saha Bloom Filter Checks for SET MEMBERSHIP efficiently Is