CS 6958 LECTURE 11 CACHES February 12, 2014 Fancy Machines - PowerPoint PPT Presentation

CS 6958 LECTURE 11 CACHES February 12, 2014

Fancy Machines � baetis.cs.utah.edu ¨ heptagenia.cs.utah.edu ¨ Quad-core Haswell Xeon @ 3.5GHz ¨ 8 threads ¤

Creative

Box Intersection Box::intersect(HitRecord& hit, const Ray& ray) const{ � float tnear, t2; � Vector inv = 1.f / ray.direction(); � Vector p1 = ( c1 - ray.origin() ) * inv; � Vector p2 = ( c2 - ray.origin() ) * inv; � Vector mins = p1.vecMin( p2 ); � Vector maxs = p1.vecMax( p2 ); � tnear = max( mins.x(), max( mins.y(), mins.z())); � t2 = min( maxs.x(), min( maxs.y(), maxs.z())); � � if(tnear < t2) � � Hit � Make sure to account for inside hits!

BVH layout ¤ int start_bvh = GetBVH(); � box corner box corner child num (-1 indicates interior node) (3 floats) (3 floats) ID children c_min c_max 1 -1 c_min c_max 3 -1 Single BVH node (8 words) start_bvh start_bvh + 8

BVH layout ¨ Sibling nodes are next to each other in memory ¨ Right child ’ s ID is always left_id + 1 … node 2 (child is 13) node 13 node 14 left child implicit right child start_bvh + (2 * 8) start_bvh + (13 * 8)

BVH layout

BVH Nodes ¨ As with all data held in global memory, recommended: BVHNode::BVHNode(int addr){ box.c1 = loadVectorFromMemory(addr + 0); box.c2 = loadVectorFromMemory(addr + 3); num_children = loadi(addr + 6); child = loadi(addr + 7); }

Leaf Nodes ¨ Implied differently: ¤ num_children > 0 ¤ child_ID = address of node’s first triangle n Not ID of first triangle! n Leaf node’s triangles are consecutive in memory

Leaf Nodes Remaining child num box corner box corner BVH Triangles tris (3 floats) (3 floats) nodes … c_min c_max 682 2 … T1 T2 … 682 (address, not ID!)

Example inline void intersect(HitRecord& hit, � const Ray& ray) const { � � � int stack[32]; � int node_id = 0; � int sp = 0; � while(true){ � int node_addr = start_bvh + node_id * 8; � BVHNode node(node_addr); � HitRecord boxHit; � node.box.intersect(boxHit, ray); � if(boxHit.didHit()) � � � // and so on... � �

Example (continued) left_id = node.child; � if ( node.num_children < 0 ) // interior node � � � { � � � stack[ sp++ ] = left_id + 1; � � � continue; � � � } � // leaf node � tri_addr = left_id; � for ( int i = 0; i < node.num_children; ++i) � {// intersect triangles} � // ... finish outer loop, manage stack

BVH Implementation My bvh class contains just a pointer to start_bvh ¨ BoundingVolumeHierarchy(int _start_bvh) � { � start_bvh = _start_bvh; � } � Nodes are loaded one at a time as needed ¨ Don’t pre-load all the nodes! ¨ Will not fit on each thread’s stack ¤

BVH Implementation inline void intersect(HitRecord& hit, � � � � � � const Ray& ray) const � � � Note that this hit record passed in is for the final hit ¨ triangle (or none if background) Don ’ t use the same one for testing against boxes! ¨

Big Picture for each pixel... � Ray ray; � � camera.makeRay(ray, x, y); � � HitRecord hit; � � scene.bvh.intersect(hit, ray); � � result = shade(...);

Updated Scene � Scene class (or struct) should no longer hold typed ¨ pointers to hard-coded scene int start_materials � PointLight the_light // only one light now � BoundingVolumeHierarchy bvh; � � ¨ Make sure you pass the scene as reference to any shade functions

Performance Remember, there are some optimizations: ¨ Traverse down closer child first ¨ Don’t traverse subtree if closer triangle already found ¨ The pseudo-code I’ve shown doesn’t do this ¨ Can be tricky! ¨ What if boxes overlap, and intersection is inside box? ¤

Program 3

Caches DRAM ¨ Why? L2 ¨ Most naïve option: L1 transfer a single word from DRAM when needed Thread ¤ This is one model a programmer PC can assume Stack RF RAM

Access Patterns (RT example) ¨ If we load c_min.x, what is the likelihood we will load c_min.y? c_min c_max 1 -1 ¨ Spatial locality ¤ Almost all data/workloads have this property

Temporal Locality ¨ In general: ¤ If we needed some data at time T, we will likely need it at T+epsilon ¨ If ray1 takes a certain path through BVH, ray2 will likely take a similar path ¨ Becomes even more important with shared caches

Temporal Locality ray1 ray2

Amortization ¨ Temporal and spatial locality are just the assumptions that allow us to amortize DRAM access ¨ Activating DRAM for a read has huge overhead ¤ The read itself is somewhat insensitive to the amount of data ¨ “DRAM exists to refill cache lines” - Erik ¤ Or: Cache lines exist hold DRAM bursts

Slightly More to it ¨ DRAM is extremely slow ¨ Caches are extremely fast ¤ But they have to be small ¨ Ideal: ¤ Hold a piece of data in cache for as long as it is possibly needed

In TRaX ¨ 65nm process ¨ 1GHz ~20 - 200 cycles DRAM (depends on pressure, access patterns) ~20 – 70nJ / read ~ 4GB 3 cycles L2 ~1.2nJ / read ~ 64KB – 4MB 1 cycle L1 ~.13nJ / read ~ 4KB – 32KB

Life of a Read ¨ LOAD r2 r0, … L1 L2 DRAM Map Map address RF address to line number to no Check tag: r0 channel, Hit? r1 then wait... r2 r3 Check tag: no Hit? evict old yes line

Address à Line ¨ Assuming 64B lines ¨ Address space >> cache size ¤ Physical cache line holds many different address ranges Addresses 0 – 63 (tag = 0) Line 0 Addresses 256 – 319 (tag = 1) … Addresses 64 – 127 (tag = 0) Line 1 … Addresses 128 – 191 (tag = 0) Line 2 … Addresses 192 – 255 (tag = 0) Line 3 …

Associativity ¨ A 2-way set-associative cache checks 2 possible lines for a given address ¤ Why?

Associativity ¨ A 2-way set-associative cache checks 2 possible lines for a given address ¤ Why? ¨ When evicting, we can now make an intelligent choice ¤ Evict oldest line, LRU, etc…

Associativity ¨ TRaX cache model is “direct-mapped” ¤ Only one address à line mapping ¨ Direct-mapped are smaller, cheaper, low-power ¤ For RT specifically, seems to work well (91-94% hitrate)

Parallel Accesses (shared cache) Thead A Thead B L1 L2 read: 193 0 (miss) 1 2 Hit 3 Miss! Incoming [A]

Parallel Accesses Thead A Thead B L1 L2 read: 193 0 (miss) read: 197 1 (miss) Hit 2 3 Miss! Incoming [A, B]

Parallel Accesses Thead A Thead B L1 L2 read: 193 0 (miss) read: 197 1 (miss) 2 complete complete 3 cached

MSHR ¨ “Miss Status Handling Register” (one per line) ¤ Tracks status/recipients for incoming line ¨ Thread B incurs “hit under miss” ¤ Difference? Thead A Thead B read: 193 0 (miss) read: 197 1 (miss) 2 complete complete 3

Hit Under Miss ¨ Thread B incurs “hit under miss” ¤ Difference? Thead A Thead B A: one L1 access, read: 193 0 (miss) one L2 access read: 197 1 (miss) 2 B: one L1 access complete complete 3

Single Thread HUM LOAD r7, r3, 0 � LOAD r9, r3, 1 � LOAD r11, r3, 2 � LOAD r6, r3, 3 � LOAD r13, r3, 4 � LOAD r8, r3, 5 � ¨ Assume relevant lines are initially uncached ¨ Generates: ¤ 6 L1 accesses ¤ 1 L2 access ¤ 1 DRAM access � �

Many-Core All processed simultaneously Suppose each of these nodes map to the same cache line (but different tag)

Ray Coherence ¨ Processing coherent rays simultaneously results in data locality ¤ Lots of research involving collecting coherent rays ¤ More on this later Coherent Incoherent

CS 6958 LECTURE 11 CACHES February 12, 2014 Fancy Machines - PowerPoint PPT Presentation

CS 6958 LECTURE 11 CACHES February 12, 2014 Fancy Machines baetis.cs.utah.edu heptagenia.cs.utah.edu Quad-core Haswell Xeon @ 3.5GHz 8 threads Creative Creative Box Intersection Box::intersect(HitRecord&

CS 6958 LECTURE 12 WRAP-UP CACHES February 19, 2014 Creative Creative Ray Coherence

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

CS 6958 LECTURE 16 PATHTRACING REVIEW, MATERIALS March 3, 2014 Recall 2 can split

CS 6958 LECTURE 8 TRIANGLES, BVH February 3, 2014 Last Time 2 derived ray-triangle

CS 6958 LECTURE 6 LIGHTS, CAMERAS January 27, 2014 Creative Creative Creative Creative

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

Introduction toAutoCAD Video Lecture for ME119 Instructor: Amitabh Bhattacharya Department of

CompCert Memory Model Because we need some way to understand how C works Outline The

SSC 335/394: Scientific and Technical Computing Computer Architectures single CPU Von Neumann

Lecture Lecture 3 3 Basic Concepts Basic Concepts Dr. Hazim Dwairi Dr Hazim Dwairi

CS 294-73 Software Engineering for Scientific Computing Lecture 9: Performance on

Instruction encoding The ISA defines The format of an instruction (syntax) The

Load Shedding in Network Monitoring Applications . Barlet-Ros 1 G. Iannaccone 2 J. Sanjus-Cuxart

A MAZON S3 Simple storage service Launched: March 14, 2006 Simple key/value storage

CS 6958 LECTURE 11 CACHES February 12, 2014 Fancy Machines - PowerPoint PPT Presentation

CS 6958 LECTURE 11 CACHES February 12, 2014 Fancy Machines baetis.cs.utah.edu heptagenia.cs.utah.edu Quad-core Haswell Xeon @ 3.5GHz 8 threads Creative Creative Box Intersection Box::intersect(HitRecord&

CS 6958 LECTURE 12 WRAP-UP CACHES February 19, 2014 Creative Creative Ray Coherence

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

CS 6958 LECTURE 16 PATHTRACING REVIEW, MATERIALS March 3, 2014 Recall 2 can split

CS 6958 LECTURE 8 TRIANGLES, BVH February 3, 2014 Last Time 2 derived ray-triangle

CS 6958 LECTURE 6 LIGHTS, CAMERAS January 27, 2014 Creative Creative Creative Creative

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

Introduction toAutoCAD Video Lecture for ME119 Instructor: Amitabh Bhattacharya Department of

CompCert Memory Model Because we need some way to understand how C works Outline The

SSC 335/394: Scientific and Technical Computing Computer Architectures single CPU Von Neumann

Lecture Lecture 3 3 Basic Concepts Basic Concepts Dr. Hazim Dwairi Dr Hazim Dwairi

CS 294-73 Software Engineering for Scientific Computing Lecture 9: Performance on

Instruction encoding The ISA defines The format of an instruction (syntax) The

Load Shedding in Network Monitoring Applications . Barlet-Ros 1 G. Iannaccone 2 J. Sanjus-Cuxart

A MAZON S3 Simple storage service Launched: March 14, 2006 Simple key/value storage

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa