von Neumann von Neumann vs. Harvard von Neumann Same memory - PDF document

Computer Architecture Microprocessor Architecture in a nutshell • Alternative approaches • Separation of CPU and memory distinguishes programmable computer. • Two opposite examples • CPU fetches instructions from memory. • SHARC • Registers help out: program counter (PC), • ARM7 instruction register (IR), general-purpose registers, etc. Chenyang Lu CSE 467S 1 Chenyang Lu CSE 467S 2 von Neumann von Neumann vs. Harvard • von Neumann • Same memory holds data, instructions. address • A single set of address/data buses between CPU and memory 200 PC data memory • Harvard CPU • Separate memories for data and instructions. 200 ADD r5,r1,r3 ADD r5,r1,r3 IR • Two sets of address/data buses between CPU and memory Chenyang Lu CSE 467S 3 Chenyang Lu CSE 467S 4 Harvard Architecture von Neumann vs. Harvard • Harvard allows two simultaneous memory fetches. address data memory • Most DSPs use Harvard architecture for data streaming data: CPU address • greater memory bandwidth; IR • more predictable bandwidth. program memory data PC Chenyang Lu CSE 467S 5 Chenyang Lu CSE 467S 6

RISC vs. CISC Microprocessors • Reduced Instruction Set Computer (RISC) • Compact, uniform instructions � facilitate pipelining RISC ARM7 ARM9 • More lines of code � large memory footprint • Allow effective compiler optimization • Complex Instruction Set Computer (CISC) Pentium SHARC CISC • Many addressing modes and long instructions (DSP) • High code density • Often require manual optimization of assembly code von Neumann Harvard for embedded systems Chenyang Lu CSE 467S 7 Chenyang Lu CSE 467S 8 Digital Signal Processor = Harvard + CISC DSP Optimizations • Streaming data • Signal processing � Need high data throughput • Support floating point operation � Harvard architecture • Efficient loops (matrix, vector operations) • Ex. Finite Impulse Response (FIR) filters • Real-time requirements • Memory footprint • Execution time must be predictable � opportunistic � Require high code density optimization in general purpose processors may not � Need CISC instead of RISC work (e.g., caching, branch prediction) Chenyang Lu CSE 467S 9 Chenyang Lu CSE 467S 10 SHARC Architecture Registers • Register files • Modified Harvard architecture. • 40 bit R0-R15 (aliased as F0-F15 for floating point) • Separate data/code memories. • Data address generator registers. • Program memory can be used to store data. • Loop registers. • Two pieces of data can be loaded in parallel • Register files connect to: • Support for signal processing • multiplier • Powerful floating point operations • shifter; • ALU. • Efficient loop • Parallel instructions Chenyang Lu CSE 467S 11 Chenyang Lu CSE 467S 12

Assembly Language SHARC Assembly • 1-to-1 representation of binary instructions • Algebraic notation terminated by semicolon: • Why do we need to know? R1=DM(M0,I0), R2=PM(M8,I8); ! comment • Performance analysis label: R3=R1+R2; • Manual optimization of critical code program memory access data memory access • Focus on architecture characteristics • NOT specific syntax Chenyang Lu CSE 467S 13 Chenyang Lu CSE 467S 14 Computation Data Types • Floating point operations • 32-bit IEEE single-precision floating-point. • 40-bit IEEE extended-precision floating-point. • Hardware multiplier • 32-bit integers. • Parallel computation • 48-bit instructions. Chenyang Lu CSE 467S 15 Chenyang Lu CSE 467S 16 Rounding and Saturation Parallel Operations • Floating-point can be: • Can issue some computations in parallel: • Rounded toward zero; • dual add-subtract; • Rounded toward nearest. • multiplication and dual add/subtract • floating-point multiply and ALU operation • ALU supports saturation arithmetic • Overflow results in max value, not rollover. R6 = R0*R4, R9 = R8 + R12, R10 = R8 - R12; • CLIP Rx within range [-Ry,Ry] • Rn = CLIP Rx by Ry; Chenyang Lu CSE 467S 17 Chenyang Lu CSE 467S 18

Memory Access Example: Exploit Parallelism if (a>b) y = c-d; else y = c+d; • Parallel load/store Compute both cases, then choose which one to store. • Circular buffer ! Load values R1=DM(_a); R2=DM(_b); R3=DM(_c); R4=DM(_d); ! Compute both sum and difference R12 = R2+R4, R0 = R2-R4; ! Choose which one to save COMP(R1,R2); IF LE R0 = R12; DM(_y) = R0 ! Write to y Chenyang Lu CSE 467S 19 Chenyang Lu CSE 467S 20 Load/Store Basic Addressing • Load/store architecture • Immediate value: • Can use direct addressing R0 = DM(0x20000000); • Two set of data address generators (DAGs): • Direct load: • program memory; R0 = DM(_a); ! Load contents of _a • data memory. • Direct store: • Can perform two load/store per cycle DM(_a)= R0; ! Stores R0 at _a • Must set up DAG registers to control loads/stores. Chenyang Lu CSE 467S 21 Chenyang Lu CSE 467S 22 DAG1 Registers Post-Modify w ith Update • I register holds base address. I0 M0 L0 B0 • M register/immediate holds modifier value. I1 M1 L1 B1 I2 M2 L2 B2 I3 M3 L3 B3 R0 = DM(I3,M3) ! Load DM(I2,1) = R1 ! Store I4 M4 L4 B4 I5 M5 L5 B5 I6 M6 L6 B6 I7 M7 L7 B7 Chenyang Lu CSE 467S 23 Chenyang Lu CSE 467S 24

Zero-Overhead Loop Circular Buffer • L: buffer size • No cost for jumping back to start of loop • B: buffer base address • Decrement counter, cmp, and jump back • I, M in post-modify mode LCNTR=30, DO L UNTIL LCE; • I is automatically wrapped around the circular buffer when it reaches B+L R0=DM(I0,M0), F2=PM(I8,M8); • Example: FIR filter R1=R0-R15; L: F4=F2+F3; Chenyang Lu CSE 467S 25 Chenyang Lu CSE 467S 26 FIR Filter on SHARC Nested Loop ! Init: Set up circular buffers for x[] and c[]. B8=PM(_x); ! I8 is automatically set to _x • PC Stack L8=4; ! Buffer size M8=1; ! Increment of x[] • Loop start address B0=DM(_c); L0=4; M0=1; ! Set up buffer for c • Return addresses for subroutines • Interrupt service routines ! Executed after new sensor data is stored in xnew • Max depth = 30 R1=DM(_xnew); • Loop Address Stack ! Use post-increment mode PM(I8,M8)=R1; • Loop end address ! Loop body • Max depth = 6 LCNTR=4, DO L UNTIL LCE; • Loop Counter Stack ! Use post-increment mode R1=DM(I0,M0), R2=PM(I8,M8); • Loop counter values L:R8=R1*R2, R12=R12+R8; • Max depth = 6 Chenyang Lu CSE 467S 27 Chenyang Lu CSE 467S 28 Example: Nested Loop SHARC • CISC + Harvard architecture S1: LCNTR=3, DO LP2 UNTIL LCE; • Computation S2: LCNTR=2, DO LP1 UNTIL LCE; • Floating point operations R1=DM(I0,M0), R2=PM(I8,M8); LP1: R8=R1*R2; • Hardware multiplier R12=R12+R8; • Parallel operations LP2: R11=R11+R12; • Memory Access • Parallel load/store • Circular buffer • Zero-overhead and nested loop Chenyang Lu CSE 467S 29 Chenyang Lu CSE 467S 30

Microprocessors ARM7 • von Neumann + RISC • Compact, uniform instruction set RISC ARM7 ARM9 • 32 bit or 12 bit • Usually one instruction/cycle • Poor code density DSP Pentium CISC • No parallel operations (SHARC) • Memory access von Neumann Harvard • No parallel access • No direct addressing Chenyang Lu CSE 467S 31 Chenyang Lu CSE 467S 32 FIR Filter on ARM7 Sample Prices ; loop initiation code • ARM7: $14.54 MOV r0, #0 ; use r0 for loop counter MOV r8, #0 ; use separate index for arrays • SHARC: $51.46 - $612.74 LDR r1, #4 ; buffer size MOV r2, #0 ; use r2 for f ADR r3, c ; load r3 with base of c[ ] ADR r5, x ; load r5 with base of x[ ] ; loop; instructions for circular buffer are not shown L: LDR r4, [r3, r8] ; get c[i] LDR r6, [r5, r8] ; get x[i] MUL r4, r4, r6 ; compute c[i]x[i] ADD r2, r2, r4 ; add into sum ADD r8, r8, #4 ; add one word to array index ADD r0, r0, #1 ; add 1 to i CMP r0, r1 ; exit? BLT L ; if i < 4, continue Chenyang Lu CSE 467S 33 Chenyang Lu CSE 467S 34 Evaluating DSP Speed MIPS/FLOPS Metrics • Do not indicate how much work is accomplished • Implement, manually optimize, and compare complete application on multiple DSPs by each instruction. • Time consuming • Depend on architecture and instruction set. • Benchmarks: a set of small pieces (kernel) of • Especially unsuitable for DSPs due to the representative code diversity of architecture and instruction sets. • Ex. FIR filter • Circular buffer load • Inherent to most embedded systems • Small enough to allow manual optimization on multiple DSPs • Zero-overhead loop • Application profile + benchmark testing • Assign relative importance of each kernel Chenyang Lu CSE 467S 35 Chenyang Lu CSE 467S 36

Other Important Metrics • Power consumption • Cost • Code density • … … Chenyang Lu CSE 467S 37 Chenyang Lu CSE 467S 38 Reading • Chapter 2 (only the sections related to slides) • Optional: J. Eyre and J. Bier, DSP Processors Hit the Mainstream, IEEE Micro, August 1998. • Optional: More about SHARC • http://www.analog.com/processors/processors/sharc/ • Nested loops: Pages (3-37) – (3-59) http://www.analog.com/UploadedFiles/Associated_Docs/476124 543020432798236x_pgr_sequen.pdf Chenyang Lu CSE 467S 39

von Neumann von Neumann vs. Harvard von Neumann Same memory - PDF document

Computer Architecture Microprocessor Architecture in a nutshell Alternative approaches Separation of CPU and memory distinguishes programmable computer. Two opposite examples CPU fetches instructions from memory. SHARC

The von Neumann Architecture The von Neumann Architecture of Computer Systems of Computer

Computer Architecture Review CS 562 1 The von Neumann Model John von Neumann (1946)

Set Theory and von Neumann algebras Rom an Sasyk ENS Lyon & Universidad de Buenos Aires

von Neumann's bottleneck von Neumann machine One control unit that connects memory and

Ozawas class S for locally compact groups and unique prime factorization of group von Neumann

Von Neumann algebras, countable groups and ergodic theory Workshop Young Researchers in

G odel, Von Neumann and the origins of theoretical computer science Alasdair Urquhart

Lecture 4: Von Neumann algebraic Hardy spaces David Blecher University of Houston December 2016

Die Entwicklung der Spielprogrammierung: Von John von Neumann bis zu den hochparallelen

A survey of the model theory of tracial von Neumann algebras Isaac Goldbring University of

Buds, Leaves and Global Warming John OKeefe Harvard Forest jokeefe@fas.harvard.edu

Creative Destruction and Subjective Well-Being Philippe Aghion Ufuk Akcigit Harvard UPenn

Buds, Leaves and Global Warming John OKeefe Harvard Forest jokeefe@fas.harvard.edu

BEEP BROOKLINE EARLY EDUCATION PROGRAM View Driving South on Harvard St. View Driving North on

Factorization and dilation problems for completely positive maps on von Neumann algebras

The Existence theorem of the Stokes-Neumann Problem Nasrin Arab CASA Tu / e 28 April 2010 Nasrin

Modeling Digital Systems with VHDL Reference: Roth & John text Chapter 2 Michael Smith

Cyber-Physical Systems Embedded Architecture IECE 553/453 Fall 2019 Prof. Dola Saha 1

Lecture Notes on Discrete-Time Signal Processing EE424 Course @ Bilkent University September 26,

Detection and Estimation Theory Introduction to ECE 531 Mojtaba Soltanalian- UIC The course

SPL A Language and Compiler for DSP Algorithms Jiangxing Jong, Jeremy Johnson, Robert Johnson and

Digital Signal Processing for Radio Detection (e.g. of Cosmic Rays) KSETA Workshop Roman Hiller,

Overview of course content Specification of digital functions and system

Administrivia Fall 2013 August 21: We removed the permit and major requirements for GR

von Neumann von Neumann vs. Harvard von Neumann Same memory - PDF document

Computer Architecture Microprocessor Architecture in a nutshell Alternative approaches Separation of CPU and memory distinguishes programmable computer. Two opposite examples CPU fetches instructions from memory. SHARC

The von Neumann Architecture The von Neumann Architecture of Computer Systems of Computer

Computer Architecture Review CS 562 1 The von Neumann Model John von Neumann (1946)

Set Theory and von Neumann algebras Rom an Sasyk ENS Lyon &amp; Universidad de Buenos Aires

von Neumann's bottleneck von Neumann machine One control unit that connects memory and

Ozawas class S for locally compact groups and unique prime factorization of group von Neumann

Von Neumann algebras, countable groups and ergodic theory Workshop Young Researchers in

G odel, Von Neumann and the origins of theoretical computer science Alasdair Urquhart

Lecture 4: Von Neumann algebraic Hardy spaces David Blecher University of Houston December 2016

Die Entwicklung der Spielprogrammierung: Von John von Neumann bis zu den hochparallelen

A survey of the model theory of tracial von Neumann algebras Isaac Goldbring University of

Buds, Leaves and Global Warming John OKeefe Harvard Forest jokeefe@fas.harvard.edu

Creative Destruction and Subjective Well-Being Philippe Aghion Ufuk Akcigit Harvard UPenn

Buds, Leaves and Global Warming John OKeefe Harvard Forest jokeefe@fas.harvard.edu

BEEP BROOKLINE EARLY EDUCATION PROGRAM View Driving South on Harvard St. View Driving North on

Factorization and dilation problems for completely positive maps on von Neumann algebras

The Existence theorem of the Stokes-Neumann Problem Nasrin Arab CASA Tu / e 28 April 2010 Nasrin

Modeling Digital Systems with VHDL Reference: Roth &amp; John text Chapter 2 Michael Smith

Cyber-Physical Systems Embedded Architecture IECE 553/453 Fall 2019 Prof. Dola Saha 1

Lecture Notes on Discrete-Time Signal Processing EE424 Course @ Bilkent University September 26,

Detection and Estimation Theory Introduction to ECE 531 Mojtaba Soltanalian- UIC The course

SPL A Language and Compiler for DSP Algorithms Jiangxing Jong, Jeremy Johnson, Robert Johnson and

Digital Signal Processing for Radio Detection (e.g. of Cosmic Rays) KSETA Workshop Roman Hiller,

Overview of course content Specification of digital functions and system

Administrivia Fall 2013 August 21: We removed the permit and major requirements for GR

Set Theory and von Neumann algebras Rom an Sasyk ENS Lyon & Universidad de Buenos Aires

Modeling Digital Systems with VHDL Reference: Roth & John text Chapter 2 Michael Smith