processor system
play

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project - PowerPoint PPT Presentation

Active Cells: A Programming Model for Configurable Multicore Systems CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A.


  1. Active Cells: A Programming Model for Configurable Multicore Systems CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM 387

  2. Acknowledgements Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A. Morzov, P. Hunziker, A. Gokhberg, FF) in the Microsoft Innovation Cluster for Embedded Systems funded by Microsoft Research (2009-2014). We are particularly inbebted for consulting to Chuck Thacker, Niklaus Wirth, Timothée Martiel, Paul Reed and Florian Negele. Thanks for Support from Xilinx Academic Program. CRBM Implementation and exercise 12 by Stephan Koster. 388

  3. Vision General Purpose Shared Memory Computer Application Specific Multicore Network On Chip core core core core core P P P P P cache cache core engine core P P bus memory core core engine P P NIL P P P NIL NIL P P P P P P P P 389

  4. Motivation: Multicore Systems Challenges  Cache Coherence  Shared Memory Communication Bottleneck  Thread Synchronization Overhead  Hard to predict performance of a program  Difficult to scale the design to massive multi-core architecture 390

  5. Operating System Challenges  Processor Time Sharing  Interrupts  Context Switches  Thread Synchronisation  Memory Sharing  Inter-process: Paging  Intra-process, Inter-Thread: Monitors 391

  6. Focus Academia: Education  holistic design of computing systems  simplicity  consistency Industry: High Performance Sensor Driven Medical IT  streaming applications: ultrasound, tomography, hemodynamics, etc. 392

  7. Focus: Streaming Applications Stream-Parallelism: Pipelining Task Parallelism: Parallel Data Paralellism: Execution Vector Computing Loop-level parallelism 393

  8. 4.1. HARDWARE BUILDING BLOCKS TRM AND INTERCONNECTS 394

  9. TRM : T iny R egister M achine*  Extremely simple processor on FPGA with Harvard architecture.  Two-stage pipelined  Each TRM contains  Arithmetic-logic unit (ALU) and a shifter.  32-bit operands and results stored in a bank of 2*8 registers.  local data memory: d*512 words of 32 bits.  local program memory: i*1024 instructions with 18 bits.  7 general purpose registers  Register H for storing the high 32 bits of a product, and 4 conditional registers C, N, V, Z.  No caches * Invented and implemented by Dr. Ling Liu and Prof. Niklaus Wirth 395

  10. TRM Machine Language  Machine language: binary representation of instructions  18-bit instructions  Three instruction types:  Type a: arithmetical and logical operations  Type b: load and store instructions  Type c: branch instructions (for jumping) 396 from Lectures on Reconfigurable Computing, Dr. Ling Liu, ETH Zürich

  11. Encoding Overview   Register Operations Special Instructions op Rd 1 x x x 0 0 0 0 0 1 (a) imm is zero extended to 32 bits op Rd 0 imm (a) 17 14 13 11 10 9 0 17 14 13 11 10 9 0 op Rd 1 Rs 0 0 0 x x 0 op VRd 1 x x x (b) 1 0 0 001 (b) 17 14 13 11 10 9 0 17 14 13 11 10 9 0 (c) op VRd 1 VRs 1 0 0 x x x (c) op Rd 1 Rs 1 0 x x x x 17 14 13 11 10 9 0 17 14 13 11 10 9 0  Load and Store (d) op Rd 1 Rs 0 1 x x x x 17 14 13 11 10 9 0 op Rd 0 Rs (a) off off is zero extended op Rd 1 Rs 101 (e) xxxx 17 14 13 11 10 9 3 0 6 17 14 13 11 10 9 0 op VRd 1 Rs off (b)  Branch and Link 3 17 14 13 11 10 9 0  Conditional Branches 1111 off off is 14-bit offset 17 14 13 0 1110 cond off off is sign extended 17 14 13 10 9 0 397

  12. TRM architecture Figure from: Niklaus Wirth, Experiments in Computer System Design , Technical Report, August 2010 http://www.inf.ethz.ch/personal/wirth/Articles/FPGA-relatedWork/ComputerSystemDesign.pdf 398

  13. Variants of TRM  FTRM  includes floating point unit  VTRM (Master Thesis Dan Tecu)  includes a vector processing unit  supports 8 x 8-word registers  available with / without FP unit  TRM with software-configurable instruction width (Master Thesis Stefan Koster, 2015) 399

  14. Initial Experiments TRM12 Bus TRM12 Ring Column 0 DDR2 Column 2 RS232 C0 C6 0111 0111 0111 0111 0111 0111 N0 N6 TRM0 1110 TRM1 TRM2 1110 TRM3 1110 TRM4 1110 TRM5 1110 1110 C7 C1 N7 N1 C2 C8 TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing N2 N8 outbound arbiter outbound arbiter inbound arbiter inbound arbiter H0 H1 node0 node1 node2 node3 node4 node5 H2 H3 inbound arbiter inbound arbiter outbound arbiter outbound arbiter N3 N9 C3 C9 node11 node10 node9 node8 node7 node6 N4 N10 C4 C10 N5 N11 TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing C5 C11 Column 3 RS232TR Column 1 Timer 0111 0111 0111 LCD 0111 0111 0111 TRM11 TRM10 TRM9 TRM8 TRM7 TRM6 1110 1110 1110 1110 1110 1110 LEDs RS232TR : RS232 transmitter receiver Ci : processor core : network controller Ni 400

  15. 4.2. HARDWARE SOFTWARE CODESIGN 401

  16. Software / Hardware Co-design Vision: Custom System on Button Push System Electronic design as circuits high-level program code Computing model Compiler, Programmable Synthesizer, Hardware Programming Language Hardware Library, (FPGA) Simulator 402

  17. Traditional HW/SW co-design Goal System specification, HW/SW One Program partitioning Program Program microcontroller in system specific C/C++ hardware in HDL One Toolchain Compilation Synthesis Microcontroller + machine code + System on FPGA specific hardware (eg. DSP) Active Cells approach for Traditional HW/SW co- embedded systems design for embedded development systems 403

  18. Inspired by Active Cells Computing Model • Kahn Process Networks • Dataflow Programming • CSP (e.g. Google's Go) • Actor Model (e.g. Erlang) On-chip distributed system Cell  Scope and environment for a running isolated process .  Integrated control thread (s)  Provides communication ports Net  Network of communication cells  Cells connected via channels (FIFOs) 404

  19. Software  Hardware Map cell channel fifo 405

  20. Consequences of the approach  No global memory  No processor sharing  No pecularities of specific processor  No predefined topology (NoC)  No interrupts  No operating system 406

  21. Cell non-typed communication ports type BernoulliSampler* = cell (probIn: port in ; valOut: port out ); var r: Random.Generator; probIn p: real; begin Bernoulli new(r); Sampler blocking receive Cell loop Activity p << probIn; valOut valOut << r.Bernoulli(p); end end BernoulliSampler; asynchronous send 407

  22. Properties Properties can influence both, generation of hardware and the generation of software code. type Controller = cell {Processor=TRM, FPU, DataMemory=2048, BitWidth=18} (in: port in (64) ; result: port out ); ... FPU Port Width begin Controller (TRM) (* ... controller action ... *) end Controller; .... 408

  23. Configurable Processor on PL T iny R egister M achine FPU Vector Unit Multiplier on on on off off off Data Memory Program Instruction Memory Width 0.5k 1k 1.5k 0k 1k 2k 3k 4k 14 16 18 20 22 409

  24. Engine type Merger = cell {Engine, inputs=1} (ind: array inputs of port in ; outd: port out ); var data: longint; begin loop … for i := 0 to len(in)-1 do data << ind[i] Merger outd << data; Engines are prebuilt components end instantiated as electronic circuits end on a target hardware end Merger; 410

  25. Unit of Deployment: (Terminal) Cellnet LearnTest = cellnet ; var learner: CRBMNet.CRBMLearner; imageReader reader: MLUtil.imageReader; ims0,ims1: MLUtil.imshow; dynamic … construction begin new (learner) CRBMLearner new (reader); new (ims0{name='v0debug',posx=0,posy=100}); new (ims1{name='v1debug',posx=300,posy=100}); … reader.imageOUT >> learner.imgIN; imShow imShow learner.v0DebugOUT >> ims0.imageIN; learner.v1DebugOUT >> ims1.imageIN; … end LearnTest; connection connection 411

  26. Hierarchical Composition img in UpNet split CRBMLearner UpStep imageReader UpNet Delay GetGradients … … debug visible Sample P(h|v) … CRBMLearner result Delay DownNet … … hidden kernels Delay visible imShow imShow UpNet hidden Delay … 412

  27. Hierarchic Composition: non-terminal Cellnet ports and properties UpNet = cellnet {vr=28,vc=28,kr=5,kc=5,k=9,c=2,name='upstep'} (pvIN, kerIN, bIN : port in ; phOUT: array k of port out ; pvSideOUT: port out ); var img in i,hr,hc: longint; upstep: CRBMUpstepCell; split: MLFunctions.SplitterCell; UpNet begin pvOut hr:=vr-kr+1; hc:=vc-kc+1; new (vSplit {dataSize=vr*vc,numOut=2}); split new (upstep {vr=vr,vc=vc,kr=kr,kc=kc,k=k,c=c}); pvIN >> vSplit.dataIN; vSplit.dataOUT[0] >> pvSideOUT; kernIn vSplit.dataOUT[1] >> UpstepCell.vIN; UpStep kerIN > > UpstepCell.kerIN; biasIn bIN >> UpstepCell.bIN; for i:=0 to k-1 do … upstep.phOUT[i] >> phOUT[i]; end ; end CRBMUpNet; phOut port delegation port delegation 413

  28. Software  Hardware Map FIFO cell softcore ZYNQ PL thread TRM ARM PS port AXI4 ARM ENGINE hw ENGINE engine AXI4 Stream engine Interconnect 414

Recommend


More recommend