Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , - PowerPoint PPT Presentation

Memory Hierarchy for Web Search Grant Ayers * � , Jung Ho Ahn † � , Christos Kozyrakis * , Partha Ranganathan ‡ Stanford University * Seoul National University † Google ‡ � Work performed while authors were at Google

The world is headed toward cloud-based services ...and we’re still optimizing for SPEC 2

Research Objective Design tomorrow’s CPU architectures for OLDI workloads like web search 1. Provide the first public in-depth study of the microarchitecture and memory system behavior of commercial web search 2. Propose new performance optimizations with a focus on the memory hierarchy Results show 27% performance improvement, and 38% with future devices. 3

Understanding Web Search on Current Architectures 4

Google’s web search is scalable Scalability / Hardware Optimizations Linear core scaling ● Not bandwidth or I/O bound ● SMT (+37%), huge pages ● (+11%), hardware prefetching (+5%) Architects can assume ● excellent software scaling 5

Google web search performance on Intel Haswell Stalls Web search leaf node CPU utilization 6

Memory Hierarchy Characterization 7

Challenges and Methodology Challenges 1. No known timing simulator can run search for non-trivial amount of virtual time 2. Performance counters are limited and often broken Methodology ● Measurements from real machines ● Trace-driven functional cache simulation (Intel Pin, 135 billion instructions) ● Analytical performance modeling 8

Working set scaling Memory accessed in steady- state Shard footprint is constant, ● but touched footprint grows with cores and time (little data locality in the shard) Heap working set converges ● around 1 GiB, suggests sharing and cold structures 9

Overall cache effectiveness ● L1 and L2 caches experience significant misses of all types ● L3 cache virtually eliminates code misses but is insufficient for heap and shard What’s the ideal L3 size? 10

L3 cache scaling L3 Hit Rate L3 MPKI 16 MiB sufficiently ● removes code misses Not even 2 GiB captures ● the shard 1 GiB would capture the ● heap 11

L3 cache scaling 16 MiB sufficient for instructions L3 Hit Rate L3 MPKI 16 MiB sufficiently ● removes code misses Not even 2 GiB captures ● the shard 1 GiB would capture the ● heap 12

L3 cache scaling 16 MiB sufficient for instructions L3 Hit Rate L3 MPKI 16 MiB sufficiently ● removes code misses Not even 2 GiB captures ● the shard 1 GiB would capture the ● Shard heap 13

L3 cache scaling 1 GiB sufficient for heap 16 MiB sufficient for instructions L3 Hit Rate L3 MPKI 16 MiB sufficiently ● removes code misses Not even 2 GiB captures ● the shard 1 GiB would capture the ● heap Large shared caches are highly effective for heap accesses. 14

L3 cache scaling 1 GiB sufficient for heap 16 MiB sufficient for instructions L3 Hit Rate L3 MPKI 16 MiB sufficiently ● removes code misses Not even 2 GiB captures ● the shard Region of 1 GiB would capture the ● diminishing heap returns Large shared caches are highly effective for heap accesses. The L3 cache is in a region of diminishing returns 15

Memory Hierarchy for Hyper-scale SoCs 16

Optimization strategy Analysis indicates diminishing returns for L3 caches, but potential for larger caches, thus two contrasting optimizations: 1. Repurpose expensive on-chip transistors in the L3 cache for cores 2. Exploit the locality in the heap with cheaper, higher-capacity DRAM incorporated into a latency-optimized L4 cache 17

Cache vs. Cores Trade-off Intel Haswell 1 ● 18 cores ● 2.5 MiB L3 per core ● Core area cost is 4 MiB L3 1 “The Xeon Processor E5-2600 v3: A 22nm 18-core product family” (ISSCC ‘15) 18

Trading cache for cores Sweep core count and L3 capacity in terms of chip area used Each core is 4 MiB of L3 ● Use CAT to vary L3 from 4.5 to 45 MiB ● Some L3 transistors could be better used for cores (9c/2.5 MiB/core worse than 11c/1.23 MiB/core) Core count is not all that matters! (All 18c with < 1 MiB/core are bad) 19

Trading cache for cores Cache-for-Cores Performance What’s the right cache per core balance? Incorporate the sweep data into a linear model ● Performance is linear with respect to core count ● We have two measurements for each cache ratio 1 MiB/core of L3 cache allows 5 extra cores and 14% performance improvement 20

Latency-optimized L4 cache Target the available locality in the fixed 1 GiB heap ● Not feasible with an on-chip SRAM cache ● Need an off-chip, on-package eDRAM cache eDRAM provides lower latency ○ Multi-chip package allows for existing 128 MiB dies ○ ● Less than 1% die area overhead ● Use an existing high-bandwidth interface such as Intel’s OPIO 21

Latency-optimized L4 cache 1 GiB on-package eDRAM ● 40-60 ns hit latency ● Based on Alloy cache ● Parallel lookups with memory ● Direct-mapped ● No coherence ● Proposed L4 Cache based on eDRAM 22

L4 cache miss profile L4 Hit Rate L4 MPKI Baseline is optimized 23-core design with 1 MiB L3 cache per core (iso-area to 18-core) 23

L4 cache miss profile L4 Hit Rate L4 MPKI Baseline is optimized 23-core design with 1 MiB L3 cache per core (iso-area to 18-core) 24

L4 cache + cache-for-cores performance L4 and Cache for Cores 27% overall performance improvement ● 22% “pessimistic” (60ns hit, 5ns ● additional miss penalty) 38% “future” (+10% latency & misses) ● 25

Ongoing work 1. Shard memory misses 2. Instruction misses 3. Branch stalls and BTB misses 4. New system balance ratios Web search leaf node CPU utilization 26

Conclusions 1. OLDI is an important class of applications about which little data is available 2. Web search is a canary application for OLDI that is inefficient in hardware 3. Through a careful rebalancing of the memory hierarchy, we’re able to improve Google’s web search by 27% today, and 38% in the future 4. There is high potential for new SoCs specifically designed for OLDI workloads 27

Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , - PowerPoint PPT Presentation

Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , Christos Kozyrakis * , Partha Ranganathan Stanford University * Seoul National University Google Work performed while authors were at Google The world is

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Learning Research objective design a schedule for Latin 102 using the most recent research from

Model for Session-based Recommendation ADVISOR: JIA-LING, KOH SOURCE: KDD 2018 SPEAKER:

2019 Annual Results Announcement 26 February 2020 1 Disclaimer These forward-looking

Investors Presentation Cautionary Statement Regarding Forward Looking Statements his

NUH Memory Menu Trish Cargill Patient Representative @nuhpatientgroup Chris Neale

School District of Holmen Honori ring g Scho hool ol Distr trict ct of Holmen lmen staff f

Gregor Mendel What is his role in genetics? S5L2: Students will recognize that offspring can

Session 35: Sustainability-driven entrepreneurship- Curriculum for new approaches to business

Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , - PowerPoint PPT Presentation

Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , Christos Kozyrakis * , Partha Ranganathan Stanford University * Seoul National University Google Work performed while authors were at Google The world is

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Learning Research objective design a schedule for Latin 102 using the most recent research from

Model for Session-based Recommendation ADVISOR: JIA-LING, KOH SOURCE: KDD 2018 SPEAKER:

2019 Annual Results Announcement 26 February 2020 1 Disclaimer These forward-looking

Investors Presentation Cautionary Statement Regarding Forward Looking Statements his

NUH Memory Menu Trish Cargill Patient Representative @nuhpatientgroup Chris Neale

School District of Holmen Honori ring g Scho hool ol Distr trict ct of Holmen lmen staff f

Gregor Mendel What is his role in genetics? S5L2: Students will recognize that offspring can

Session 35: Sustainability-driven entrepreneurship- Curriculum for new approaches to business

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several