A Reconfigurable Architecture for Load-Balanced Rendering Jiawen - PowerPoint PPT Presentation

A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand Graphics Hardware July 31, 2005, Los Angeles, CA

The Load Balancing Problem data parallel • GPUs: fixed resource allocation V V V V – Fixed number of functional units per task R R R R – Horizontal load balancing task achieved via data parallelism parallel T T T T – Vertical load balancing impossible for many F F F F applications D D D D • Our goal: flexible allocation – Both vertical and horizontal Parallelism in multiple – On a per-rendering pass basis graphics pipelines

Application-specific load balancing Input V Vertex Vertex Sync Triangle Setup P Pixel Pixel Simplified graphics pipeline Screenshot from Counterstrike

Application-specific load balancing Input V Vertex Vertex Sync Triangle Setup R Rasterizer Rasterizer Rest of Rest of Screenshot from Doom 3 Pixel Pipeline Pixel Pipeline Simplified graphics pipeline

Our Approach: Hardware • Use a general-purpose multi-core processor – With a programmable communications network – Map pipeline stages to one Diagram of a 4x4 Raw processor or more cores • MIT Raw Processor – 16 general purpose cores – Low-latency programmable network Die Photo of 16-tile Raw chip

Our Approach: Software Input • Specify graphics pipeline in software as a stream program split – Easily reconfigurable V Vertex Vertex • Static load balancing – Stream graph specifies join resource allocation – Tailor stream graph to Triangle Setup rendering pass split • StreamIt programming P language Pixel Pixel Sort-middle graphics pipeline stream graph

Benefits of Programmable Approach • Compile stream program to multi-core processor • Flexible resource allocation Stream graph for graphics pipeline • Fully programmable pipeline StreamIt – Pipeline specialization • Nontraditional configurations – Image processing – GPGPU Layout on 8x8 Raw

Related Work • Scalable Architectures – Pomegranate [Eldridge et al., 2000] • Streaming Architectures – Imagine [Owens et al., 2000] • Unified Shader Architectures – ATI Xenos

Outline • Background – Raw Architecture – StreamIt programming language • Programmer Workflow – Examples and Results • Future Work

The Raw Processor • A scalable computation fabric – Mesh of identical tiles – No global signals • Programmable interconnect – Integrated into bypass paths – Register mapped – Fast neighbor communications A 4x4 Raw chip – Essential for flexible resource allocation Computation • Raw tiles Resources – Compute processor – Programmable Switch Processor Switch Processor Diagram

The Raw Processor • Current hardware – 180nm process – 16 tiles at 425 MHz – 6.8 GFLOPS peak – 47.6 GB/s memory bandwidth • Simulation results based on 8x8 configuration – 64 tiles at 425 MHz Die photo of 16-tile Raw chip – 27.2 GFLOPS peak 180nm process, 331 mm 2 – 108.8 GB/s memory bandwidth (32 ports)

StreamIt • High-level stream programming language – Architecture independent • Structured Stream Model – Computation organized as filters in a stream graph – FIFO data channels – No global notion of time – No global state Example stream graph

StreamIt Graph Constructs filter pipeline may be any StreamIt language construct feedback loop splitter joiner splitjoin parallel computation Graphics pipeline stream graph splitter joiner

Automatic Layout and Scheduling • StreamIt compiler performs layout, scheduling on Raw – Simulated annealing layout algorithm – Generates code for compute processors – Generates routing schedule for switch processors Input Vertex Processor StreamIt Sync Compiler Triangle Setup Rasterizer Pixel Processor Frame Buffer Layout on 8x8 Raw Stream graph

Outline • Background – Raw Architecture – StreamIt programming language • Programmer Workflow – Examples and Results • Future Work

Programmer Workflow Input • For each rendering pass split – Estimate resource requirements V Vertex Vertex – Implement pipeline in StreamIt join – Adjust splitjoin widths – Compile with StreamIt Triangle Setup compiler split – Profile application P Pixel Pixel Sort-middle Stream Graph

Switching Between Multiple Configurations • Multi-pass rendering algorithms – Switch configurations between passes – Pipeline flush required anyway (e.g. shadow volumes) Configuration 1 Configuration 2

Experimental Setup • Compare reconfigurable pipeline against fixed resource allocation • Use same inputs on Raw simulator • Compare throughput and utilization Input Vertex Processor Sync Triangle Setup Rasterizer Pixel Processor Frame Buffer Fixed Resource Allocation: Manual layout on Raw 6 vertex units, 15 pixel pipelines

Example: Phong Shading • Per-pixel phong-shaded polyhedron • 162 vertices, 1 light • Covers large area of screen • Allocate only 1 vertex unit • Exploit task parallelism – Devote 2 tiles to pixel shader – 1 for computing the lighting direction and normal – 1 for shading • Pipeline specialization – Eliminate texture coordinate interpolation, etc Output, rendered using the Raw simulator

Phong Shading Stream Graph Phong Shading Stream Graph Automatic Layout on Raw Input Rasterizer Vertex Processor Pixel Processor A Triangle Setup Pixel Processor B Frame Buffer

Utilization Plot: Phong Shading Reconfigurable pipeline Fixed pipeline

Example: Shadow Volumes • 4 textured triangles, 1 point light • Very large shadow volumes cover most of the screen • Rendered in 3 passes – Initialize depth buffer – Draw extruded shadow volume geometry with Z-fail algorithm – Draw textured triangles with stencil testing • Different configuration for each pass – Adjust ratio of vertex to pixel units – Eliminate unused operations Output, rendered using the Raw simulator

Shadow Volumes Stream Graph: Passes 1 and 2 Input Rasterizer Frame Buffer Vertex Processor Triangle Setup

Shadow Volumes Stream Graph: Pass 3 Shadow Volumes Pass 3 Stream Graph Automatic Layout on Raw Input Rasterizer Vertex Processor Texture Lookup Triangle Setup Texture Filtering Frame Buffer

Utilization Plot: Shadow Volumes Fixed pipeline Pass 1 Pass 2 Pass 3 Reconfigurable pipeline Pass 1 Pass 2 Pass 3

Limitations • Software rasterization is extremely slow – 55 cycles per fragment • Memory system – Technique does not optimize for texture access

Future Work • Augment Raw with special purpose hardware • Explore memory hierarchy – Texture prefetching – Cache performance • Single-pass rendering algorithms – Load imbalances may occur within a pass – Decompose scene into multiple passses – Tradeoff between throughput gained from better load balance and cost of flush • Dynamic Load Balancing

Summary • Reconfigurable Architecture – Application-specific static load balancing – Increased throughput and utilization • Ideas: – General-purpose multi-core processor – Programmable communications network – Streaming characterization

Acknowledgements • Mike Doggett, Eric Chan • David Wentzlaff, Patrick Griffin, Rodric Rabbah, and Jasper Lin • John Owens • Saman Amarasinghe • Raw group at MIT • DARPA, NSF, MIT Oxygen Alliance

A Reconfigurable Architecture for Load-Balanced Rendering Jiawen - PowerPoint PPT Presentation

A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frdo Durand Graphics Hardware July 31, 2005, Los Angeles, CA The Load Balancing Problem data parallel

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Object Space Volume Rendering Object Space Volume Rendering Ronald Peikert SciVis 2010 - Object

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

VCD: Rendering Objective: To understand, and be able to use, tonal rendering to communicate and

Non-Photorealistic Rendering Non-Photorealistic Rendering Pen-and-Ink Illustrations Pen-and-Ink

Object Space Volume Rendering 4-1 Ronald Peikert SciVis 2007 - Object Space Volume Rendering

Six- DOF Haptic Rendering I Outline Motivation Direct rendering Proxy-based rendering

Non-Photorealistic Rendering Non-Photorealistic Rendering Pen-and-Ink Illustrations Pen-and-Ink

Modeling and Rendering Architecture Modeling and Rendering Architecture from Photographs from

Modeling and Rendering Architecture Modeling and Rendering Architecture from Photographs from

Reconfigurable Computing Computing Reconfigurable Design and implementation implementation

Reconfigurable Computing Reconfigurable Computing Design and implementation Design and

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Network Simplex Method Combinatorial Problem Solving (CPS) Enric Rodr guez-Carbonell April

How to determine if a random graph with a fixed degree sequence has a giant component Felix Joos,

Getting Gremlins to Improve Your Data C h a d G r e e n B e e r C i t y C o d e M a y 3 1 ,

Graph Traversals 1 Depth-first search (DFS) G can be directed or undirected DFS(v)

Simple realizability of complete abstract topological graphs simplified Jan Kyn cl Charles

Coupling On-line and Off-line Random Graphs Woojin Kim March 1st Introduction Preliminary

A basis of the fixed point subgroup of an automorphism of a free group Oleg Bogopolski and Olga

Overview and Introduction to Ray Tracing Shaders Chris Wyman, NVIDIA Twitter: @_cwyman_ E-mail: