The programmer's view The programmer's view of a dynamically - - PowerPoint PPT Presentation

the programmer s view the programmer s view of a
SMART_READER_LITE
LIVE PREVIEW

The programmer's view The programmer's view of a dynamically - - PowerPoint PPT Presentation

The programmer's view The programmer's view of a dynamically reconfigurable of a dynamically reconfigurable architecture architecture Luciano Lavagno Lavagno Luciano Politecnico di di Torino Torino Politecnico lavagno@polito.it


slide-1
SLIDE 1

The programmer's view The programmer's view

  • f a dynamically reconfigurable
  • f a dynamically reconfigurable

architecture architecture

Luciano Luciano Lavagno Lavagno Politecnico Politecnico di di Torino Torino

lavagno@polito.it lavagno@polito.it Joint work with: Joint work with: Fabio Fabio Campi Campi, Roberto , Roberto Guerrieri Guerrieri, Andrea Lodi, Claudio , Andrea Lodi, Claudio Mucci Mucci, Mario , Mario Toma Toma Universita Universita’ ’ di di Bologna Bologna Francesco Francesco Gregoretti Gregoretti, Alberto La Rosa, , Alberto La Rosa, Mihai Mihai Lazarescu Lazarescu, Claudio , Claudio Passerone Passerone Politecnico Politecnico di di Torino Torino

slide-2
SLIDE 2

Outline Outline

  • Motivations

Motivations

  • The Target Reconfigurable Processor (

The Target Reconfigurable Processor (XiRisc XiRisc) )

  • Design Space Exploration

Design Space Exploration

– – Design flow Design flow – – Optimizations and limitations Optimizations and limitations

  • Turbo

Turbo-

  • decoder example

decoder example

– – Memory optimizations Memory optimizations – – Dynamic instructions selection Dynamic instructions selection – – Mapping Mapping – – Experimental results Experimental results

  • Conclusions

Conclusions

slide-3
SLIDE 3

1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 Jan-90 Jan-95 Jan-00 Jan-05 Jan-10 ASIC altera xilinx DSP MIPS intel trend MPU

Motivations Motivations

The reconfiguration landscape The reconfiguration landscape

Sub Sub-

  • word ops.

word ops. Loop buffers Loop buffers Embedded Embedded MAC MAC Xilinx Xilinx Virtex Virtex Reconfiguration frequency Reconfiguration frequency Reset Reset Context Context Clock Clock Fine Fine Coarse Coarse Reconfiguration granularity Reconfiguration granularity FPGAs FPGAs Processors Processors Processors with Processors with dynamically dynamically reconfigurable HW reconfigurable HW Source: Philips Source: Philips GOPS GOPS

slide-4
SLIDE 4

Past work Past work

  • Reconfigurable array as co

Reconfigurable array as co-

  • processor:

processor:

– – GARP (Callahan), Nimble compiler (Li) GARP (Callahan), Nimble compiler (Li)

  • Reconfigurable array as functional unit:

Reconfigurable array as functional unit:

– – Prisc Prisc ( (Razdan Razdan), ), Chimaera Chimaera (Hauck), Concise ( (Hauck), Concise (Kastrup Kastrup) )

  • Key issues:

Key issues:

– – path to memory and I/O limitations (co path to memory and I/O limitations (co-

  • processor better)

processor better) – – ease of integration into ISA and compiler (FU better) ease of integration into ISA and compiler (FU better) – – row row-

  • based architecture for good arithmetic op mapping

based architecture for good arithmetic op mapping – – efficient HW synthesis onto non efficient HW synthesis onto non-

  • standard architecture

standard architecture

slide-5
SLIDE 5
  • 2

2-

  • Channel VLIW

Channel VLIW Elaboration Elaboration

  • Shared DSP

Shared DSP-

  • like

like function units function units

  • Embedded pGA

Embedded pGA device device

The The XiRisc XiRisc Architecture Architecture

slide-6
SLIDE 6

Dynamic Instruction Set Extension Dynamic Instruction Set Extension

slide-7
SLIDE 7

….. pgaload ….. ….. ….. pgaop $3,$4,$5 …... …... Add $8, $3 Register File Register File

Configuration Memory Configuration Memory

Dynamic Instruction Set Extension Dynamic Instruction Set Extension

slide-8
SLIDE 8

PiCoGA PiCoGA Control Control Unit Unit

Computing on Computing on the PiCoGA the PiCoGA

Mapping Pga_op2 Mapping Pga_op1 Data Flow Graph Data out Data in

slide-9
SLIDE 9

Multi Multi-

  • context

context Array Array

Func

  • Func. 1

. 1 Func

  • Func. 2

. 2 Func

  • Func. 3

. 3 Func

  • Func. 4

. 4 Func

  • Func. n

. n

Configuration Configuration Cache Cache PiCoGA PiCoGA

  • Four configuration planes are available

Four configuration planes are available

  • Plane switching takes one clock cycle

Plane switching takes one clock cycle

  • While one plane is loading, others can work undisturbed

While one plane is loading, others can work undisturbed

slide-10
SLIDE 10

Design Space Exploration Design Space Exploration

  • Software developer’s perspective:

Software developer’s perspective:

– – Wants only the speed Wants only the speed-

  • up (cc

up (cc -

  • OR foo.c)

OR foo.c) – – Does not want to see the architecture Does not want to see the architecture

  • Reconfigurable processor compilers enable the transparent use

Reconfigurable processor compilers enable the transparent use

  • f the reconfigurable instruction set via:
  • f the reconfigurable instruction set via:

– – Pseudo Pseudo-

  • function calls (“

function calls (“intrinsics intrinsics”) ”) – – Language extensions ( Language extensions (pragmas pragmas) )

  • Design flow:

Design flow:

– – Identify compute intensive kernels Identify compute intensive kernels – – Group instructions into sets of user Group instructions into sets of user-

  • defined

defined pGA pGA instructions instructions – – Use cost figures to compare costs and performance of different Use cost figures to compare costs and performance of different HW/SW partitions HW/SW partitions – – Refine cost figures by manual or automatic synthesis Refine cost figures by manual or automatic synthesis

slide-11
SLIDE 11

C source Front-end pGA insn. identification

HIR

Compiler Scheduler

LIR

Simulation Profiling Assembler Griffy-C

  • bj

bitstream Design Space Exploration Design Space Exploration Backend Backend

XiRisc XiRisc Design Flow Design Flow

slide-12
SLIDE 12

Manual Manual pGAop pGAop identification: example identification: example

int bar (int a, int b) { int c; #pragma pgaop sa 0x12 5 1 2 c a b c = (a << 2) + b; #pragma end return c + a; } main() { i = bar(2,3); return; } int i; int bar (int a, int b) { int c; #if defined(PGA) asm ("pga5 0x12,%0,%1,%2":"=r"(c):"r"(a),"r"(b)); #else asm ("topga %1, %2, $0"::"r"(a),"r"(b)); asm ("jal _sa"); asm ("fmpga %0, $0, $0": "=r"(c): ); #endif return c + a; } ... #if !defined(PGA) void _sa () { int c,a,b; asm("move %0,$2;move %1,$3": "=r"(a),"=r"(b):"r"(c): "$2","$3","$4"); c = (a << 2) + b; /* delay by 5 cycles */ asm("move $2,%0; li $4,5": : "r"(c) : "$2","$3","$4" ); } #endif

slide-13
SLIDE 13

Emulation Function with Latency and Issue Delay

Back Back-

  • end

end

Configuration Bits

Place & Route Mapping

  • DFG-based description
  • Single Assignment
  • Manual Dismantling

High-Level C Compiler Griffy Compiler Griffy-C

slide-14
SLIDE 14

5 5 10 10 15 15 20 20 25 25 30 30 35 35 40 40

Contributions to Power Consumption Contributions to Power Consumption Instruction memory Instruction memory Data Memory Data Memory Bus architecture Bus architecture Register File Register File Alu Alu Shifter Shifter Multiplier Multiplier Exception handling Exception handling Instruction decode Instruction decode Pipeline control Pipeline control

Design Space Exploration Design Space Exploration

Optimizations for the Reconfigurable Array Optimizations for the Reconfigurable Array

Increase Performance Increase Performance

Increase concurrency Increase concurrency Minimize memory accesses Minimize memory accesses Customize data Customize data-

  • width

width Optimize data structures Optimize data structures

Reduce Energy Reduce Energy

Reduce instruction fetches Reduce instruction fetches Reduce data fetches Reduce data fetches

slide-15
SLIDE 15
  • Exploit concurrency

Exploit concurrency

– – within the reconfigurable array within the reconfigurable array

  • horizontally: operate on multiple data

horizontally: operate on multiple data

  • vertically: pipelined implementation

vertically: pipelined implementation

– – with respect to the standard data with respect to the standard data-

  • path

path

  • Optimize data memory

Optimize data memory

– – internal storage reduces register spills internal storage reduces register spills – – reordering and shifting are free reordering and shifting are free – – pack data into a single word (SIMD pack data into a single word (SIMD

  • peration)
  • peration)
  • Optimize instruction memory

Optimize instruction memory

– – reduced instruction fetches reduced instruction fetches

Design Space Exploration Design Space Exploration

Optimizations for the Reconfigurable Array Optimizations for the Reconfigurable Array

Increase Performance Increase Performance

Increase concurrency Increase concurrency Minimize memory accesses Minimize memory accesses Customize data Customize data-

  • width

width Optimize data structures Optimize data structures

Reduce Energy Reduce Energy

Reduce instruction fetches Reduce instruction fetches Reduce data fetches Reduce data fetches

slide-16
SLIDE 16

Design Space Exploration Design Space Exploration

Limitations of the Reconfigurable Array Limitations of the Reconfigurable Array

  • No direct access to memory

No direct access to memory

– – processor memory access unit is a bottleneck processor memory access unit is a bottleneck

  • Finite number of read/write register ports (operands)

Finite number of read/write register ports (operands)

– – 4 read, 2 write 4 read, 2 write

  • Finite chip area

Finite chip area

  • Number of custom instructions

Number of custom instructions

  • Reconfiguration time

Reconfiguration time

– – 4 configuration caches 4 configuration caches

  • Limited control flow

Limited control flow

– – can implement data dependent loops and if can implement data dependent loops and if-

  • then

then-

  • else

else

slide-17
SLIDE 17

UMTS Turbo UMTS Turbo-

  • decoder

decoder

  • UMTS (3GPP) Turbo Code Specification:

UMTS (3GPP) Turbo Code Specification:

– – 8 states trellis 8 states trellis RSC (1,15/13) RSC (1,15/13)8

8

Rate=1/3 Rate=1/3 – – variable frame size variable frame size 40 40 ≤ ≤ K K ≤ ≤ 5114 5114

  • BCJR algorithm

BCJR algorithm

– – Max log Max log-

  • Map + linear correction

Map + linear correction – – 16 bit fixed point precision 16 bit fixed point precision

  • BPSK modulation along with AWGN channel

BPSK modulation along with AWGN channel

slide-18
SLIDE 18

UMTS Turbo UMTS Turbo-

  • decoder

decoder

Block diagram Block diagram

+ + + + X X Z' Z' Z Z Y

γ γ α α β β LLR LLR

Int. Int.-

  • 1

1

Int. Int. Int. Int. SISO SISO SISO SISO

slide-19
SLIDE 19

UMTS Turbo UMTS Turbo-

  • decoder

decoder

Memory layout optimizations Memory layout optimizations

γ γ2

2

γ γ3

3

γ γ0 γ γ1

1

4 4 2 2 1 1 3 3 4 4 5 5 6 6 7 7 2 2 1 1 3 3 5 5 6 6 7 7 γ γ0 γ γ3

3

Processing Processing Processing Processing Reordering Reordering 2 2 1 1 3 3 4 4 5 5 6 6 7 7 2 2 1 1 3 3 4 4 5 5 6 6 7 7 2 2 1 1 3 3 4 4 5 5 6 6 7 7 2 2 1 1 3 3 4 4 5 5 6 6 7 7 2 2 1 1 3 3 4 4 5 5 6 6 7 7 2 2 1 1 3 3 4 4 5 5 6 6 7 7 2 2 1 1 3 3 4 4 5 5 6 6 7 7 Processing Processing Reordering Reordering α α1

1

α α0 α α1

1

α α0 Data width and Data width and Packing Packing

slide-20
SLIDE 20

UMTS Turbo UMTS Turbo-

  • decoder

decoder

pGA pGA Instruction Selection Instruction Selection

γ γ0 γ γ3

3

γ γ1

1

γ γ2

2

α α0 α α1

1

α α2

2

α α3

3

α α4

4

α α5

5

α α6

6

α α7

7

α α’ ’0 α α’ ’4

4

α α’ ’5

5

α α’ ’1

1

α α’ ’2

2

α α’ ’6

6

α α’ ’7

7

α α’ ’3

3

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

α α’ ’0 α α’ ’1

1

α α’ ’4

4

α α’ ’5

5

α α’ ’2

2

α α’ ’3

3

α α’ ’6

6

α α’ ’7

7

α α0 α α1

1

α α4

4

α α5

5

α α2

2

α α3

3

α α6

6

α α7

7

slide-21
SLIDE 21

4 4 3 3 2 2 1 1 max*: max*: y = y = ln(e ln(ea

a +

+ e eb

b)

) if if (abs(a (abs(a-

  • b)

b) ≤ ≤ T) T) corr corr = = (T (T -

  • abs(a

abs(a -

  • b))

b)); ; else else corr corr = 0; = 0; return return (max(a, b) + (max(a, b) + corr corr); );

UMTS Turbo UMTS Turbo-

  • decoder

decoder

Mapping Mapping pGA pGA instructions instructions

  • Speculative execution

Speculative execution

  • Two concurrent max*

Two concurrent max*

– – input/output data packed input/output data packed

  • Pipelined implementation

Pipelined implementation

– – latency: 4 + 1 clock cycles latency: 4 + 1 clock cycles – – issue delay: 1 clock cycle issue delay: 1 clock cycle

2 segment piece 2 segment piece-

  • wise linear approx.

wise linear approx. a a -

  • b

b b b -

  • a

a Mux Mux corr corr = T = T -

  • | a

| a -

  • b |

b | Mux Mux max(a, b) max(a, b) a a b b Mux Mux max(a, b) + max(a, b) + corr corr T T Saturation Saturation

slide-22
SLIDE 22

Experimental Results Experimental Results

UMTS Turbo Decoder K=40 single iteration UMTS Turbo Decoder K=40 single iteration

1x 177834 Original

  • Speedup

Execution Cycles Step Saved cycles

1.02x 173706 Gamma (1) 4128 1.83x 96913 LLR (2) 1.53x 115816 Butterfly (3) 80921 62018

11.73x

15157

(1)+(2)+(3)+(4)

162677

11.90x 14795 (1) + (2) + (3) + (4) 162907 1.91x 92785 (1) + (2) 5.78x 30767 (1) + (2) +(3) 85049 147067

Estimation

Final

1.10x 161826 Reorder (4) 15972

Profile

slide-23
SLIDE 23

Experimental Results: Experimental Results: Speed Speed-

  • up

up and and Energy Energy Reduction Reduction

4.3x 4.3x 49% 49% CRC CRC 7.7x 7.7x 60% 60% Median Median filter filter 10x 10x 80% 80% Motion Motion estimation estimation 11.7x 11.7x 75% 75% Turbo decoder Turbo decoder 13.5x 13.5x 89% 89% DES DES encryption encryption

Speed Speed-

  • up

up (vs. std. (vs. std. XiRisc XiRisc) ) Energy Energy reduction reduction (vs. (vs. std std. . XiRisc XiRisc) )

Algorithm Algorithm

As a comparison, Tensilica achieves

  • 10X better speed-up (50-100X) on similar examples
  • using 10X more gates (i.e. similar area)
  • without the reconfiguration flexibility
slide-24
SLIDE 24

Conclusions Conclusions

  • Reconfigurable computing dramatically speeds up highly

Reconfigurable computing dramatically speeds up highly data and control intensive applications data and control intensive applications

  • Traditional software and hardware design flows do not

Traditional software and hardware design flows do not support reconfigurable architectures support reconfigurable architectures

  • Software oriented design flow:

Software oriented design flow:

– – fast exploration of HW and SW alternatives fast exploration of HW and SW alternatives – – no detailed knowledge of the underlined architecture no detailed knowledge of the underlined architecture

  • Future work

Future work

– – automatic kernel extraction automatic kernel extraction – – fully automated path to implementation fully automated path to implementation