network on chip assisted adaptive partitioning and
play

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR - PowerPoint PPT Presentation

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR DYNAMIC HOMOGENEOUS MANYCORES Davide Bertozzi MPSoC Research Group University of Ferrara Italy email : davide.bertozzi@unife.it A collaboration with Jos


  1. NETWORK‐ON‐CHIP‐ASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR “DYNAMIC” HOMOGENEOUS MANYCORES Davide Bertozzi MPSoC Research Group – University of Ferrara – Italy email : davide.bertozzi@unife.it A collaboration with José Flich, Universidad Politecnica de Valencia (Spain) A collaboration with Giorgos Dimitrakopoulos, Democritus University of Thrace (Greece)

  2. Workload consolidation Consolidation of multiple computation workloads onto the same high‐ end embedded computing platform is well‐underway in many domains Aggregation of ECUs Multimedia Home Gateways IoT Platforms Embedded system virtualization is one relevant branch of this trend

  3. Heterogeneous Parallel Computer Architecture General GOPS/Watt Purpose Host multi‐core High‐ Programmable heterog. Hardware Speed Accelerator processor Parallel threads accelerators Massive HW I/O Highest heavily dependent on multithreading for Coarse‐grain GOPS/W local data content data‐parallelism parallelism General‐ Throughput HW Top‐level NoC Purpose Computing IPs Computing (GPGPUs) (SMPs) DMA DRAM Reconfigurable engine memory Programmable and Graphics fabric controller customizable accelerators The accelerator store Specialization and Parallelism are THE design paradigm for embedded SoCs  Proliferation of more or less programmable computing acceleration resources.  Dark silicon will be harnessed through specialization. Multi‐Programmed Parallel Hardware Mixed‐Criticality Platforms Workload

  4. Concurrent Acceleration Requests Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others Running processes Host Processor HOW TO SHARE THE ACCELERATOR? Programmable Manycore Accelerator Leverage fine‐grained temporal multiplexing, relying on dedicated hardware support for fast and lightweight context switching! the same full‐fledged HW solutions proposed in high‐end GPGPUs won’t be affordable in low‐power SoCS! High‐end GPGPUs

  5. Concurrent Acceleration Requests Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others Running processes Host Processor HOW TO SHARE THE ACCELERATOR? Programmable Manycore Accelerator Use a coarser form of accelerator time‐sharing: execute offload requests in a run‐to‐completion, first‐come first‐served manner Overly long waiting times. Latency‐critical requests may have to resort to host execution. Common sense

  6. Concurrent Acceleration Requests Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others Running processes Host Processor HOW TO SHARE THE ACCELERATOR? Programmable Manycore Accelerator To shorten time‐to‐completion, use up all of the available cores! Embedded applications exhibit a limited amount of data parallelism, alternated with task‐level parallelism. HPC Performance is likely to saturate as core allocation grows.

  7. Concurrent Acceleration Requests OK, I got it: SPATIAL‐DIVISION MULTIPLEXING (SDM) is the solution! Running processes Host Processor Programmable Manycore Accelerator SDM is trivial. Where is the challenge?  Current accelerator architectures are at odds with SDM  Partition just the cores? Or also the memory?  Should match program parallelism to execution environment  Software‐only solutions cannot provide complete isolation  Designing SDM for Predictability? For Security? For both? Much more than a concept: a design philosophy!

  8. Does SDM make sense at all? Image Processing benchmarks run on a gem5‐based General‐Purpose Many‐Core Platform simulator to emphasize the computation speedup, minimizing communication and memory access effects: ideal crossbar and 1 cycle memory access latency 9 IDEAL FAST ROD Convert 8 DetectUniScaleResize Distance GaussianBlur ComputeKeypoints rBrief 7 ALIZED SPEEDUP 6 5 4 NORM 3 2 1 0 1 2 3 4 5 6 7 8 9 #CLUSTERS With some exceptions, the trend is confirmed: real applications cannot exploit the whole parallelism provided by hardware! Relying on Space‐Division Multiplexing approach we relinquish the maximum parallelism but:  Such parallelism is actually not needed  Non‐Uniform Memory Access (NUMA) effects can be minimized  Interferences of other applications are avoided inside the partition Davide Bertozzi MPSoC Research Group

  9. What about TDM? A batch of applications (8 requests for each app, 9 apps in total) is run and evaluated with several memory configurations by using the SDM and the coarse‐grain TDM approaches. SDM: L2 partitioning, best schedule SDM: L2 partitioning, random schedule SDM: global L2, best schedule SDM: global L2, random schedule Memory partitioning helps smooth out NUMA effects SDM overtakes TDM, providing speedups on the whole execution by 35% in the best case (i.e., full knowledge of incoming request pattern) and by 19% with random scheduling of acceleration requests. Davide Bertozzi MPSoC Research Group

  10. SDM Technology • SDM is OK, but how? • The Mapping Challenge • The Reconfiguration Challenge • The Adaptivity Challenge

  11. The Isolation Property The traffic generated by different applications collides in the accelerator NoC…… ……as the NoC paths are shared between ……even for smart allocation schemes! nodes assigned to different applications. It might be a good idea to prevent traffic from different applications from mixing. Or not? At least for better composability and analyzability. Smart task allocation cannot guarantee the isolation property. You need NoC support for that!

  12. Our Approach: Routing Restrictions A deterministic (or partially adaptive) routing algorithm without cyclic dependencies among links or buffers can be represented by the set of routing restrictions it imposes For irregular topologies as well  A routing restriction forbids any packets to use two consecutive channels Can we design a routing mechanism that finds a packet’s way to destination by interpreting such routing restrictions?

  13. Logic‐Based Distributed Routing Destination Switch (Xcurr,Ycurr) FORBIDDEN! LBDR logic: 1‐ compute the target quadrant NORTH‐EAST 2‐ Take North if at next hop I can turn east QUADRANT 3‐ Take East if at next hop I can turn north 4‐ Go East...provided the East port is connected! Routing logic is assisted by a 26-bit configuration register per switch: - Routing restrictions are coded at each switch by means of routing bits Rxy - Unconnected ports are coded by means of connectivity bits Cx Lower More flexible than algorithmic routing: More scalable than routing tables: coverage it supports different routing algorithms the configuration register stays the than routing and (not all) irregular 2D mesh topologies same regardless the network size tables (~80%)

  14. Basic Partitioning Support Setting connectivity bits to zero at partition boundaries prevents messages from escaping from their partition Additional benefits: • complexity in the order of algorithmic xy LBDR configuration bits • no modification of the routing algorithm required • no additional provisioning to guarantee deadlock freedom • no virtual channel needed (yet)

  15. The Flexibility Challenge With the basic approach, not all partition shapes are feasible! Out‐of‐reach! There is a mismatch between partition shapes .....and in fact another routing algorithm and the underlying routing algorithm... works for the same partition shapes! TWO POSSIBLE SOLUTIONS:  set up only those partition shapes that are “legal” for the chosen routing algorithm  adapt the routing algorithms to the partition shapes

  16. What about Global Traffic? Not all network traffic is headed to switches inside the partition. Global traffic to memory controllers and/or unpartitioned L2 should be supported! Solution Provide two sets of LBDR bits, differing only in the connectivity bits Underlying philosophy: There is ONE GLOBAL ROUTING ALGORITHM ‐ for intra‐partition messages ‐ as well as for global traffic The routing algorithm is Local unmodifid Cx bits Global  no deadlock risks Cx bits ….but you start «invading» other partitions!

  17. Why not Changing the Philosophy? Unrelated per‐partition algorithms: different algorithms are implemented in each partition, locally deadlock‐free but globally not. E.g., different instances of the Segment‐based Routing (SR) strategy applied on a partition‐basis Global traffic support is not straightforward any more, since the global routing function is not necessarily deadlock‐free any more!

  18. What about Global Traffic? Two virtual channels to separate local from global traffic Each virtual channel has its own routing algorithm ~2X INCREASE IN COMPLEXITY OF THE LBDR ROUTING MECHANISM! MOREOVER, YOU HAVE VIRTUAL CHANNELS! What about Isolation?  VC0 traffic (local) suffers from link‐level interference with VC1 traffic (global). Can be solved by using 2 networks!  VC1 traffic is a mix of global traffic originating from different partitions. Can be solved only through temporal isolation!

Recommend


More recommend