Active Cells: A Programming Model for Configurable Multicore Systems CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM 387
Acknowledgements Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A. Morzov, P. Hunziker, A. Gokhberg, FF) in the Microsoft Innovation Cluster for Embedded Systems funded by Microsoft Research (2009-2014). We are particularly inbebted for consulting to Chuck Thacker, Niklaus Wirth, Timothée Martiel, Paul Reed and Florian Negele. Thanks for Support from Xilinx Academic Program. CRBM Implementation and exercise 12 by Stephan Koster. 388
Vision General Purpose Shared Memory Computer Application Specific Multicore Network On Chip core core core core core P P P P P cache cache core engine core P P bus memory core core engine P P NIL P P P NIL NIL P P P P P P P P 389
Motivation: Multicore Systems Challenges Cache Coherence Shared Memory Communication Bottleneck Thread Synchronization Overhead Hard to predict performance of a program Difficult to scale the design to massive multi-core architecture 390
Operating System Challenges Processor Time Sharing Interrupts Context Switches Thread Synchronisation Memory Sharing Inter-process: Paging Intra-process, Inter-Thread: Monitors 391
Focus Academia: Education holistic design of computing systems simplicity consistency Industry: High Performance Sensor Driven Medical IT streaming applications: ultrasound, tomography, hemodynamics, etc. 392
Focus: Streaming Applications Stream-Parallelism: Pipelining Task Parallelism: Parallel Data Paralellism: Execution Vector Computing Loop-level parallelism 393
4.1. HARDWARE BUILDING BLOCKS TRM AND INTERCONNECTS 394
TRM : T iny R egister M achine* Extremely simple processor on FPGA with Harvard architecture. Two-stage pipelined Each TRM contains Arithmetic-logic unit (ALU) and a shifter. 32-bit operands and results stored in a bank of 2*8 registers. local data memory: d*512 words of 32 bits. local program memory: i*1024 instructions with 18 bits. 7 general purpose registers Register H for storing the high 32 bits of a product, and 4 conditional registers C, N, V, Z. No caches * Invented and implemented by Dr. Ling Liu and Prof. Niklaus Wirth 395
TRM Machine Language Machine language: binary representation of instructions 18-bit instructions Three instruction types: Type a: arithmetical and logical operations Type b: load and store instructions Type c: branch instructions (for jumping) 396 from Lectures on Reconfigurable Computing, Dr. Ling Liu, ETH Zürich
Encoding Overview Register Operations Special Instructions op Rd 1 x x x 0 0 0 0 0 1 (a) imm is zero extended to 32 bits op Rd 0 imm (a) 17 14 13 11 10 9 0 17 14 13 11 10 9 0 op Rd 1 Rs 0 0 0 x x 0 op VRd 1 x x x (b) 1 0 0 001 (b) 17 14 13 11 10 9 0 17 14 13 11 10 9 0 (c) op VRd 1 VRs 1 0 0 x x x (c) op Rd 1 Rs 1 0 x x x x 17 14 13 11 10 9 0 17 14 13 11 10 9 0 Load and Store (d) op Rd 1 Rs 0 1 x x x x 17 14 13 11 10 9 0 op Rd 0 Rs (a) off off is zero extended op Rd 1 Rs 101 (e) xxxx 17 14 13 11 10 9 3 0 6 17 14 13 11 10 9 0 op VRd 1 Rs off (b) Branch and Link 3 17 14 13 11 10 9 0 Conditional Branches 1111 off off is 14-bit offset 17 14 13 0 1110 cond off off is sign extended 17 14 13 10 9 0 397
TRM architecture Figure from: Niklaus Wirth, Experiments in Computer System Design , Technical Report, August 2010 http://www.inf.ethz.ch/personal/wirth/Articles/FPGA-relatedWork/ComputerSystemDesign.pdf 398
Variants of TRM FTRM includes floating point unit VTRM (Master Thesis Dan Tecu) includes a vector processing unit supports 8 x 8-word registers available with / without FP unit TRM with software-configurable instruction width (Master Thesis Stefan Koster, 2015) 399
Initial Experiments TRM12 Bus TRM12 Ring Column 0 DDR2 Column 2 RS232 C0 C6 0111 0111 0111 0111 0111 0111 N0 N6 TRM0 1110 TRM1 TRM2 1110 TRM3 1110 TRM4 1110 TRM5 1110 1110 C7 C1 N7 N1 C2 C8 TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing N2 N8 outbound arbiter outbound arbiter inbound arbiter inbound arbiter H0 H1 node0 node1 node2 node3 node4 node5 H2 H3 inbound arbiter inbound arbiter outbound arbiter outbound arbiter N3 N9 C3 C9 node11 node10 node9 node8 node7 node6 N4 N10 C4 C10 N5 N11 TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing C5 C11 Column 3 RS232TR Column 1 Timer 0111 0111 0111 LCD 0111 0111 0111 TRM11 TRM10 TRM9 TRM8 TRM7 TRM6 1110 1110 1110 1110 1110 1110 LEDs RS232TR : RS232 transmitter receiver Ci : processor core : network controller Ni 400
4.2. HARDWARE SOFTWARE CODESIGN 401
Software / Hardware Co-design Vision: Custom System on Button Push System Electronic design as circuits high-level program code Computing model Compiler, Programmable Synthesizer, Hardware Programming Language Hardware Library, (FPGA) Simulator 402
Traditional HW/SW co-design Goal System specification, HW/SW One Program partitioning Program Program microcontroller in system specific C/C++ hardware in HDL One Toolchain Compilation Synthesis Microcontroller + machine code + System on FPGA specific hardware (eg. DSP) Active Cells approach for Traditional HW/SW co- embedded systems design for embedded development systems 403
Inspired by Active Cells Computing Model • Kahn Process Networks • Dataflow Programming • CSP (e.g. Google's Go) • Actor Model (e.g. Erlang) On-chip distributed system Cell Scope and environment for a running isolated process . Integrated control thread (s) Provides communication ports Net Network of communication cells Cells connected via channels (FIFOs) 404
Software Hardware Map cell channel fifo 405
Consequences of the approach No global memory No processor sharing No pecularities of specific processor No predefined topology (NoC) No interrupts No operating system 406
Cell non-typed communication ports type BernoulliSampler* = cell (probIn: port in ; valOut: port out ); var r: Random.Generator; probIn p: real; begin Bernoulli new(r); Sampler blocking receive Cell loop Activity p << probIn; valOut valOut << r.Bernoulli(p); end end BernoulliSampler; asynchronous send 407
Properties Properties can influence both, generation of hardware and the generation of software code. type Controller = cell {Processor=TRM, FPU, DataMemory=2048, BitWidth=18} (in: port in (64) ; result: port out ); ... FPU Port Width begin Controller (TRM) (* ... controller action ... *) end Controller; .... 408
Configurable Processor on PL T iny R egister M achine FPU Vector Unit Multiplier on on on off off off Data Memory Program Instruction Memory Width 0.5k 1k 1.5k 0k 1k 2k 3k 4k 14 16 18 20 22 409
Engine type Merger = cell {Engine, inputs=1} (ind: array inputs of port in ; outd: port out ); var data: longint; begin loop … for i := 0 to len(in)-1 do data << ind[i] Merger outd << data; Engines are prebuilt components end instantiated as electronic circuits end on a target hardware end Merger; 410
Unit of Deployment: (Terminal) Cellnet LearnTest = cellnet ; var learner: CRBMNet.CRBMLearner; imageReader reader: MLUtil.imageReader; ims0,ims1: MLUtil.imshow; dynamic … construction begin new (learner) CRBMLearner new (reader); new (ims0{name='v0debug',posx=0,posy=100}); new (ims1{name='v1debug',posx=300,posy=100}); … reader.imageOUT >> learner.imgIN; imShow imShow learner.v0DebugOUT >> ims0.imageIN; learner.v1DebugOUT >> ims1.imageIN; … end LearnTest; connection connection 411
Hierarchical Composition img in UpNet split CRBMLearner UpStep imageReader UpNet Delay GetGradients … … debug visible Sample P(h|v) … CRBMLearner result Delay DownNet … … hidden kernels Delay visible imShow imShow UpNet hidden Delay … 412
Hierarchic Composition: non-terminal Cellnet ports and properties UpNet = cellnet {vr=28,vc=28,kr=5,kc=5,k=9,c=2,name='upstep'} (pvIN, kerIN, bIN : port in ; phOUT: array k of port out ; pvSideOUT: port out ); var img in i,hr,hc: longint; upstep: CRBMUpstepCell; split: MLFunctions.SplitterCell; UpNet begin pvOut hr:=vr-kr+1; hc:=vc-kc+1; new (vSplit {dataSize=vr*vc,numOut=2}); split new (upstep {vr=vr,vc=vc,kr=kr,kc=kc,k=k,c=c}); pvIN >> vSplit.dataIN; vSplit.dataOUT[0] >> pvSideOUT; kernIn vSplit.dataOUT[1] >> UpstepCell.vIN; UpStep kerIN > > UpstepCell.kerIN; biasIn bIN >> UpstepCell.bIN; for i:=0 to k-1 do … upstep.phOUT[i] >> phOUT[i]; end ; end CRBMUpNet; phOut port delegation port delegation 413
Software Hardware Map FIFO cell softcore ZYNQ PL thread TRM ARM PS port AXI4 ARM ENGINE hw ENGINE engine AXI4 Stream engine Interconnect 414
Recommend
More recommend