embedded application pull
play

Embedded Application pull 1TOPS/W 3D gaming 3D TV 3D ambient - PDF document

Multiprocessor Allocation and Scheduling using Advanced Optimization Technology L. Benini, M. Milano, D. Bertozzi* M. Lombardi, A. Guerri, M. Ruggiero Universit di Bologna, *Universit di Ferrara Embedded Application pull 1TOPS/W 3D


  1. Multiprocessor Allocation and Scheduling using Advanced Optimization Technology L. Benini, M. Milano, D. Bertozzi* M. Lombardi, A. Guerri, M. Ruggiero Università di Bologna, *Università di Ferrara Embedded Application pull 1TOPS/W 3D gaming 3D TV 3D ambient Structured interaction decoding Ubiquitous 3D projected navigation Autonomous display driving HMI by motion Structured Gesture detection encoding 100GOPS/W Expression recognition Gbit radio Collision H264 Adaptive avoidance encoding route Language Gesture Emotion dictation recognition recognition UWB A/V Sign 5 GOPS/W Image streaming recognition recognition 802.11n Mobile Si Xray Base-band H264 decoding Auto Fully recognition personalization (security) 2005 2007 2009 2011 2013 2015 [IMEC] Year of Introduction 1

  2. MPSoC – 2005 ITRS roadmap 60 1200 [Martin06] 50 1000 Logic, Memory Size (Normalized to 2005) 878 Number of Processing Engines 40 800 669 30 600 526 424 20 400 348 268 212 10 200 133 161 101 79 63 46 32 23 16 0 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Number of Processing Engines Total Logic Size Total Memory Size (Right Axis) (Normalized to 2005, Left Axis) (Normalized to 2005, Left Axis) MPSoC Platform Evolution Middleware, RTOS, API, Applications Software opt. Run-Time Controller Mapping & scheduling V,Vt,Fclk,I L I/O 45 nm 3D stacked main memory router P Bus based E Multi Proc R 2 <4mm I Net P H Int E 30Mtr R Local Power A Memory Test L hierarchy Mgmt S <1GHz 2

  3. Design as optimization � Design space The set of “all” possible design choices � Constraints Solutions that we are not willing to accept � Cost function A property we are interested in (execution time, power, reliability…) When & Why Offline Optimization? � Plenty of design-time knowledge � Applications pre-characterized at design time � Dynamic transitions between different pre- characterized scenarios � Aggressive exploitation of system resources � Reduces overdesign (lowers cost) � Strong performance guarantees Applicable for many embedded applications 3

  4. Application Mapping T 1 T 2 T 4 T 8 T 3 T 5 T 6 T 7 … Proc. 1 Proc. N Proc. 2 n o T 1 i t a c o l l A INTERCONNECT T 2 T 3 … Private Private Private T 4 T 5 T 6 Mem Mem Mem S c Resources h e d u Deadline T 7 l e T 3 T 5 T 7 T 8 T 1 T 2 T 4 T 8 Time � The problem of allocating, scheduling for task graphs on multi-processors in a distributed real-time system is NP- hard. � New tool flows for efficient mapping of multi-task applications onto hardware platforms Scheduling & Voltage Scaling Energy/speed trade-offs: Different voltages: varying the voltages different frequencies CPU P V dd f 1 f 2 f 3 V bs t τ 1 τ 2 τ 3 deadline Power Voltage and Frequency scaling Slack make the problem even harder! t τ 2 τ 3 τ 1 deadline Current off-line approaches solve mapping, scheduling and voltage Mapping and scheduling: selection separately (sequentially) given (fastest freq.) 4

  5. Target architecture CLOCK 1 CLOCK 1 Homogeneous computation CLOCK 2 CLOCK 2 � CLOCK 3 CLOCK 3 CLOCK N CLOCK N tiles: CLOCK CLOCK System System CLOCK TREE CLOCK TREE Tile … Tile … GENERATOR GENERATOR Tile Tile Tile Tile Tile Tile � ARM cores (including Int_CLK Int_CLK Prog. Prog. … … instruction and data caches); Sync. Sync. Sync. Sync. Sync. Sync. Sync. Sync. REG REG AMBA AHB INTERCONNECT AMBA AHB INTERCONNECT � Tightly coupled software- controlled scratch-pad Private Private Private Private Private Private Private Private .. .. Private Private Shared Shared INT INT memories (SPM); Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Slave Slave Variable Voltage/Frequency cores with � AMBA AHB; � discrete (Vdd,f) pairs DMA engine; � Frequency dividers scale down the baseline � RTEMS OS; 200 MHz system clock � Power models for 0.13µm power Cores use non-cacheable shared memory to � � models (STM) communicate Semaphore and interrupt facilities are used � for synchronization Application model � Task graph � A group of tasks T � Task dependencies � Execution times express in clock cycles: WCN(Ti) � Communication time (writes & reads) expressed as: WCN(W TiTj ) and WCN(R TiTj ) � These values can be back-annotated from functional simulation or computed using WCET analysis tools (e.g. AbsINT) � Node type � Normal; Fork, And; Branch, Or WCN(T 2 ) WCN(W T 2 T 4 ) WCN(T 4 ) WCN(R T 2 T 4 ) WCN(W T 1 T 2 ) WCN(W T 4 T 6 ) Task2 Task4 WCN(R T 1 T 2 ) WCN(R T 4 T 6 ) WCN(T 1 ) WCN(T 6 ) Task1 Task6 WCN(W T 1 T 3 ) WCN(W T 5 T 6 ) Task3 Task5 WCN(R T 1 T 3 ) WCN(R T 5 T 6 ) WCN(W T 3 T 5 ) WCN(R T 3 T 5 ) WCN(T 3 ) WCN(T 5 ) 5

  6. Task memory requirements Private ARM System Bus #2 SPM Mem #1 Each task has three kinds Core of memory requirements Semaphores � Program Data Int controller � Internal State Private � Communication queues ARM Mem SPM Core Semaphores Int controller Task storage can be allocated by Optimizer: � On the local SPM � On the remote Private Memory Communicating tasks might run: � On the same processor → negligible communication cost � On different processors → costly message exchange procedure Task memory requirements Private ARM SPM System Bus Mem #1 Each task has three kinds Core of memory requirements: Semaphores � Program Data; Int controller � Internal State; Private � Communication queues. ARM Mem #2 SPM Core Semaphores Int controller Task storage can be allocated by Optimizer: � On the local SPM � On the remote Private Memory Communicating tasks might run: � On the same processor → negligible communication cost � On different processors → costly message exchange procedure 6

  7. Application Development Flow Application Development Support Simulator Optimizer CTG Optimization Characterization Application Profiles Phase Phase n o g i n t a i l c u o d l e l A h c S Optimal SW Application Platform Implementation Execution Optimization framework � Deterministic & stochastic task graphs � Constraints � Resources: computation, communication, storage � Timing: task deadlines, makespan � Objective functions � Performance (e.g. Makespan) � Power (energy) � Bus utilization � General modeling framework � highly unstructured optimization problems � No black-box/generic optimizer can solve them efficiently � We developed a flexible algorithmic framework wich is tuned on specific problems 7

  8. Logic Based Benders Decomposition Memory constraints Obj. Function: Allocation Communication cost & Freq. Assign.: & energy consumption INTEGER PROGRAMMING No good: linear Valid constraint allocation Timing Scheduling: constraint CONSTRAINT PROGRAMMING Decomposes the problem into 2 sub-problems: � � Allocation & Assignment (& freq. setting) → IP � Objective Function: E.g.: minimizing energy consumption during execution and communication of tasks � Scheduling → CP � Objective Function: E.g.: minimizing energy consumption during frequency switching Computational scalability Deterministic task graphs, mapping & scheduling 16 25 36 49 64 81 100 1 2 3 4 5 6 7 � Simplified CP and IP formulations � Hybrid approach clearly outperforms pure CP and IP techniques � Search time bounded to 1000 sec. � CP and IP can found a solution only in 50%- of the instances � Hybrid approach always found a solution 8

  9. Computational Scalability Deterministic task graphs, mapping & scheduling & v,f selection Stochastic task graphs, mapping & scheduling & min bus usage � Hundreds of of decision variables � Much beyond ILP solver or CP solver capability Optimality gap Comparison with heuristic 2-phase solution (GA) “timing barrier” gap significant when constraints are tight 9

  10. Challenge: the Abstraction Gap Optimization Development Abstraction Platform Modelling Starting Implementation gap Optimization Analysis Final Implementation Optimal Solution . . Platform Execution ( � The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours. � Programmers must be conscious about simplified assumptions taken into account in optimization tools. Validation of optimizer solutions Throughput 0.25 Optimizer Probability (%) 0.2 250 instances Optimal 0.15 Allocation 0.1 & Schedule 0.05 Virtual Platform 0 -5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% validation Throughput difference (%) -0.05 � MAX error lower than 10% � AVG error equal to 4.51%, with standard deviation of 1.94 � All deadlines are met 10

Recommend


More recommend