PROCESSOR SYSTEM 387 Acknowledgements Results originate in project - PowerPoint PPT Presentation

Active Cells: A Programming Model for Configurable Multicore Systems CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM 387

Acknowledgements Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A. Morzov, P. Hunziker, A. Gokhberg, FF) in the Microsoft Innovation Cluster for Embedded Systems funded by Microsoft Research (2009-2014). We are particularly inbebted for consulting to Chuck Thacker, Niklaus Wirth, Timothée Martiel, Paul Reed and Florian Negele. Thanks for Support from Xilinx Academic Program. CRBM Implementation and exercise 12 by Stephan Koster. 388

Vision General Purpose Shared Memory Computer Application Specific Multicore Network On Chip core core core core core P P P P P cache cache core engine core P P bus memory core core engine P P NIL P P P NIL NIL P P P P P P P P 389

Motivation: Multicore Systems Challenges  Cache Coherence  Shared Memory Communication Bottleneck  Thread Synchronization Overhead  Hard to predict performance of a program  Difficult to scale the design to massive multi-core architecture 390

Operating System Challenges  Processor Time Sharing  Interrupts  Context Switches  Thread Synchronisation  Memory Sharing  Inter-process: Paging  Intra-process, Inter-Thread: Monitors 391

Focus Academia: Education  holistic design of computing systems  simplicity  consistency Industry: High Performance Sensor Driven Medical IT  streaming applications: ultrasound, tomography, hemodynamics, etc. 392

Focus: Streaming Applications Stream-Parallelism: Pipelining Task Parallelism: Parallel Data Paralellism: Execution Vector Computing Loop-level parallelism 393

4.1. HARDWARE BUILDING BLOCKS TRM AND INTERCONNECTS 394

TRM : T iny R egister M achine*  Extremely simple processor on FPGA with Harvard architecture.  Two-stage pipelined  Each TRM contains  Arithmetic-logic unit (ALU) and a shifter.  32-bit operands and results stored in a bank of 2*8 registers.  local data memory: d*512 words of 32 bits.  local program memory: i*1024 instructions with 18 bits.  7 general purpose registers  Register H for storing the high 32 bits of a product, and 4 conditional registers C, N, V, Z.  No caches * Invented and implemented by Dr. Ling Liu and Prof. Niklaus Wirth 395

TRM Machine Language  Machine language: binary representation of instructions  18-bit instructions  Three instruction types:  Type a: arithmetical and logical operations  Type b: load and store instructions  Type c: branch instructions (for jumping) 396 from Lectures on Reconfigurable Computing, Dr. Ling Liu, ETH Zürich

Encoding Overview   Register Operations Special Instructions op Rd 1 x x x 0 0 0 0 0 1 (a) imm is zero extended to 32 bits op Rd 0 imm (a) 17 14 13 11 10 9 0 17 14 13 11 10 9 0 op Rd 1 Rs 0 0 0 x x 0 op VRd 1 x x x (b) 1 0 0 001 (b) 17 14 13 11 10 9 0 17 14 13 11 10 9 0 (c) op VRd 1 VRs 1 0 0 x x x (c) op Rd 1 Rs 1 0 x x x x 17 14 13 11 10 9 0 17 14 13 11 10 9 0  Load and Store (d) op Rd 1 Rs 0 1 x x x x 17 14 13 11 10 9 0 op Rd 0 Rs (a) off off is zero extended op Rd 1 Rs 101 (e) xxxx 17 14 13 11 10 9 3 0 6 17 14 13 11 10 9 0 op VRd 1 Rs off (b)  Branch and Link 3 17 14 13 11 10 9 0  Conditional Branches 1111 off off is 14-bit offset 17 14 13 0 1110 cond off off is sign extended 17 14 13 10 9 0 397

TRM architecture Figure from: Niklaus Wirth, Experiments in Computer System Design , Technical Report, August 2010 http://www.inf.ethz.ch/personal/wirth/Articles/FPGA-relatedWork/ComputerSystemDesign.pdf 398

Variants of TRM  FTRM  includes floating point unit  VTRM (Master Thesis Dan Tecu)  includes a vector processing unit  supports 8 x 8-word registers  available with / without FP unit  TRM with software-configurable instruction width (Master Thesis Stefan Koster, 2015) 399

Initial Experiments TRM12 Bus TRM12 Ring Column 0 DDR2 Column 2 RS232 C0 C6 0111 0111 0111 0111 0111 0111 N0 N6 TRM0 1110 TRM1 TRM2 1110 TRM3 1110 TRM4 1110 TRM5 1110 1110 C7 C1 N7 N1 C2 C8 TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing N2 N8 outbound arbiter outbound arbiter inbound arbiter inbound arbiter H0 H1 node0 node1 node2 node3 node4 node5 H2 H3 inbound arbiter inbound arbiter outbound arbiter outbound arbiter N3 N9 C3 C9 node11 node10 node9 node8 node7 node6 N4 N10 C4 C10 N5 N11 TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing C5 C11 Column 3 RS232TR Column 1 Timer 0111 0111 0111 LCD 0111 0111 0111 TRM11 TRM10 TRM9 TRM8 TRM7 TRM6 1110 1110 1110 1110 1110 1110 LEDs RS232TR : RS232 transmitter receiver Ci : processor core : network controller Ni 400

4.2. HARDWARE SOFTWARE CODESIGN 401

Software / Hardware Co-design Vision: Custom System on Button Push System Electronic design as circuits high-level program code Computing model Compiler, Programmable Synthesizer, Hardware Programming Language Hardware Library, (FPGA) Simulator 402

Traditional HW/SW co-design Goal System specification, HW/SW One Program partitioning Program Program microcontroller in system specific C/C++ hardware in HDL One Toolchain Compilation Synthesis Microcontroller + machine code + System on FPGA specific hardware (eg. DSP) Active Cells approach for Traditional HW/SW co- embedded systems design for embedded development systems 403

Inspired by Active Cells Computing Model • Kahn Process Networks • Dataflow Programming • CSP (e.g. Google's Go) • Actor Model (e.g. Erlang) On-chip distributed system Cell  Scope and environment for a running isolated process .  Integrated control thread (s)  Provides communication ports Net  Network of communication cells  Cells connected via channels (FIFOs) 404

Software  Hardware Map cell channel fifo 405

Consequences of the approach  No global memory  No processor sharing  No pecularities of specific processor  No predefined topology (NoC)  No interrupts  No operating system 406

Cell non-typed communication ports type BernoulliSampler* = cell (probIn: port in ; valOut: port out ); var r: Random.Generator; probIn p: real; begin Bernoulli new(r); Sampler blocking receive Cell loop Activity p << probIn; valOut valOut << r.Bernoulli(p); end end BernoulliSampler; asynchronous send 407

Properties Properties can influence both, generation of hardware and the generation of software code. type Controller = cell {Processor=TRM, FPU, DataMemory=2048, BitWidth=18} (in: port in (64) ; result: port out ); ... FPU Port Width begin Controller (TRM) (* ... controller action ... *) end Controller; .... 408

Configurable Processor on PL T iny R egister M achine FPU Vector Unit Multiplier on on on off off off Data Memory Program Instruction Memory Width 0.5k 1k 1.5k 0k 1k 2k 3k 4k 14 16 18 20 22 409

Engine type Merger = cell {Engine, inputs=1} (ind: array inputs of port in ; outd: port out ); var data: longint; begin loop … for i := 0 to len(in)-1 do data << ind[i] Merger outd << data; Engines are prebuilt components end instantiated as electronic circuits end on a target hardware end Merger; 410

Unit of Deployment: (Terminal) Cellnet LearnTest = cellnet ; var learner: CRBMNet.CRBMLearner; imageReader reader: MLUtil.imageReader; ims0,ims1: MLUtil.imshow; dynamic … construction begin new (learner) CRBMLearner new (reader); new (ims0{name='v0debug',posx=0,posy=100}); new (ims1{name='v1debug',posx=300,posy=100}); … reader.imageOUT >> learner.imgIN; imShow imShow learner.v0DebugOUT >> ims0.imageIN; learner.v1DebugOUT >> ims1.imageIN; … end LearnTest; connection connection 411

Hierarchical Composition img in UpNet split CRBMLearner UpStep imageReader UpNet Delay GetGradients … … debug visible Sample P(h|v) … CRBMLearner result Delay DownNet … … hidden kernels Delay visible imShow imShow UpNet hidden Delay … 412

Hierarchic Composition: non-terminal Cellnet ports and properties UpNet = cellnet {vr=28,vc=28,kr=5,kc=5,k=9,c=2,name='upstep'} (pvIN, kerIN, bIN : port in ; phOUT: array k of port out ; pvSideOUT: port out ); var img in i,hr,hc: longint; upstep: CRBMUpstepCell; split: MLFunctions.SplitterCell; UpNet begin pvOut hr:=vr-kr+1; hc:=vc-kc+1; new (vSplit {dataSize=vr*vc,numOut=2}); split new (upstep {vr=vr,vc=vc,kr=kr,kc=kc,k=k,c=c}); pvIN >> vSplit.dataIN; vSplit.dataOUT[0] >> pvSideOUT; kernIn vSplit.dataOUT[1] >> UpstepCell.vIN; UpStep kerIN > > UpstepCell.kerIN; biasIn bIN >> UpstepCell.bIN; for i:=0 to k-1 do … upstep.phOUT[i] >> phOUT[i]; end ; end CRBMUpNet; phOut port delegation port delegation 413

Software  Hardware Map FIFO cell softcore ZYNQ PL thread TRM ARM PS port AXI4 ARM ENGINE hw ENGINE engine AXI4 Stream engine Interconnect 414

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project - PowerPoint PPT Presentation

Active Cells: A Programming Model for Configurable Multicore Systems CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A.

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Processor'General'Concepts 1 Basic'Processor1Based'System Processor' Registers core

The Big Picture: Where are We Now? I/O System Design Issues interrupts Processor Network

UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only

Refrigerated Foods Processor of the Year This award honors a refrigerated foods processor for its

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Runahead Runahead Runahead Runahead High Level Description High Level Description

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

cl a simple form of computation used widely one way to find patterns with thanks to

ARM FPUs: Low Latency is Low Energy David Lutz 1 Every computer has a power budget device

FPGA security Nele Mentens nele.mentens@kuleuven.be Design and security of cryptographic

Transfer entropy for network reconstruction in a simple dynamical model Roy Goodman NJIT Dept.

AIRS Tuning and Performance Tests Larry McMillin Climate Research and Applications Division

Bas Basic ic El Elec. ec. En Engr gr. . Lab Lab ECS EC S 210 Dr. Prapun Suksompong

STUDY OF K + + 0 DECAY QUYNH HUONG VUONG YAMANAKA GROUP YEAR-END PRESENTATIONS

CooRM v2: An RMS with Support for Non-predictably Evolving Applications Cristian KLEIN, Christian

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project - PowerPoint PPT Presentation

Active Cells: A Programming Model for Configurable Multicore Systems CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A.

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Embedded systems &amp; the Nios II soft core processor A Nios II processor system I equivalent to

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Chapter 12 CPU Structure and Function Contents Processor organization Register

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Processor'General'Concepts 1 Basic'Processor1Based'System Processor' Registers core

The Big Picture: Where are We Now? I/O System Design Issues interrupts Processor Network

UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only

Refrigerated Foods Processor of the Year This award honors a refrigerated foods processor for its

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Runahead Runahead Runahead Runahead High Level Description High Level Description

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

cl a simple form of computation used widely one way to find patterns with thanks to

ARM FPUs: Low Latency is Low Energy David Lutz 1 Every computer has a power budget device

FPGA security Nele Mentens nele.mentens@kuleuven.be Design and security of cryptographic

Transfer entropy for network reconstruction in a simple dynamical model Roy Goodman NJIT Dept.

AIRS Tuning and Performance Tests Larry McMillin Climate Research and Applications Division

Bas Basic ic El Elec. ec. En Engr gr. . Lab Lab ECS EC S 210 Dr. Prapun Suksompong

STUDY OF K + + 0 DECAY QUYNH HUONG VUONG YAMANAKA GROUP YEAR-END PRESENTATIONS

CooRM v2: An RMS with Support for Non-predictably Evolving Applications Cristian KLEIN, Christian

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to