automatically tuning task based programs for multi core
play

Automatically Tuning Task-Based Programs for Multi-core Processors - PowerPoint PPT Presentation

Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine Motivation Recent microprocessor trends Number of


  1. Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine

  2. Motivation • Recent microprocessor trends – Number of cores increased rapidly – Architectures vary widely • Challenges for software development – Parallelization is now key for performance – Current parallel programming model: threads + locks • Hard to develop correct and efficient parallel software • Hard to adapt software to changes in architectures

  3. Goals • Automatically generate parallel implementation • Automatically tune parallel implementation

  4. Overview Profile Data Program Processor Specification Implementation Generator Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Code Generator Bamboo Compiler Optimized multi-core binary Multi-core Processor

  5. Example • MonteCarlo Example – Partitions problem into several simulations – Executes the simulations in parallel – Aggregates results of all simulations

  6. Bamboo Language • A hybrid language combines data-flow and Java – Programs are composed of tasks – Tasks compose with dataflow-like semantics – Tasks contain Java-like object-oriented code internally – Programs cannot explicitly invoke tasks – Runtime automatically invokes tasks • Supports standard object-oriented constructs including methods and classes

  7. Bamboo Language • Flags – Capture current role (type state) of object in computation – Each flag captures an aspect of the object’s state – Change as the object’s role evolves in program – Support orthogonal classifications of objects

  8. task startup(StartupObject s in initialstate) { class Simulator { flag run; Aggregator aggr = new Aggregator(s.args[0]){merge:=true}; flag submit; for(int i = 0; i < 4; i++) flag finished; Simulator sim = new Simulator(aggr){run:=true}; ... taskexit(s: initialstate:=false); } } class Aggregator { task simulate(Simulator sim in run) { flag merge; sim.runSimulate(); flag finished; taskexit(sim: run:=false, submit:=true); … } } task aggregate(Aggregator aggr in merge, Simulator sim in submit) { boolean allprocessed = aggr.aggregateResult(sim); if (allprocessed) taskexit(aggr: merge:=false, finished:=true; sim: submit:=false, finished:=true); taskexit(sim: submit:=false, finished:=true); }

  9. Bamboo Program Execution Runtime new StartupObject initialization Global Flagged Object Space StartupObject initialstate state finished state

  10. Bamboo Program Execution execute startup StartupObject on task Global Flagged Object Space StartupObject initialstate state finished state

  11. Bamboo Program Execution set startup StartupObject task new Aggregator Simulator Simulator Simulator Simulator Global Flagged Object Space StartupObject initialstate state finished state merge state finished state Aggregator Simulator submit state run state finished state

  12. Bamboo Program Execution StartupObject Aggregator execute execute simulate simulate Simulator Simulator simulate on on task task Simulator Simulator Global Flagged Object Space simulate simulate task execute on task execute on StartupObject initialstate state finished state merge state finished state Aggregator Simulator run state submit state finished state

  13. Bamboo Program Execution StartupObject Aggregator set set simulate simulate Simulator Simulator task task Simulator Simulator Global Flagged Object Space simulate simulate set set task task StartupObject initialstate state finished state finished state Aggregator merge state Simulator run state submit state finished state

  14. Bamboo Program Execution aggregate StartupObject task execute on Aggregator Simulator Simulator Simulator Simulator Global Flagged Object Space StartupObject initialstate state finished state Aggregator merge state finished state Simulator run state submit state finished state

  15. Bamboo Program Execution aggregate StartupObject task Aggregator set Simulator Simulator Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state

  16. Bamboo Program Execution StartupObject Aggregator execute on aggregate Simulator Simulator task Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state

  17. Bamboo Program Execution StartupObject Aggregator set aggregate Simulator Simulator task Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state

  18. Bamboo Program Execution StartupObject Aggregator aggregate Simulator Simulator task execute on Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state

  19. Bamboo Program Execution StartupObject Aggregator aggregate Simulator Simulator task set Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state

  20. Bamboo Program Execution StartupObject Aggregator Simulator Simulator aggregate task Simulator Simulator execute on Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state

  21. Bamboo Program Execution StartupObject Aggregator Simulator Simulator aggregate task Simulator Simulator set Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state

  22. Implementation Generation Profile Data Bamboo Program Processor Specification Implementation Generator Candidate implementations Bamboo Compiler

  23. Implementation Generation • Dependence Analysis: analyzes data dependence between tasks • Parallelism Exploration: extracts potential parallelism • Mapping to Cores: maps the program to real processor

  24. Flag State Transition Graph (FSTG) Simulator run simulate:32Mcyc; 100% submit aggregate:2Mcyc; 100% finished

  25. Combined Flag State Transition Graph (CFSTG) StartupObject initialstate Number of new objects startup:3Mcyc; 100% finished 4 1 Simulator Aggregator run merge aggregate:2Mcyc; 75% simulate:32Mcyc; 100% aggregate:2Mcyc; 25% finished submit aggregate:2Mcyc; 100% finished

  26. Initial Mapping Core Group StartupObject initialstate startup:3Mcyc; 100% finished 1 4 Simulator Aggregator run merge aggregate:2Mcyc; 75% simulate:32Mcyc; 100% aggregate:2Mcyc; 25% finished submit aggregate:2Mcyc; 100% finished

  27. Preprocessing Phase • Identifies strongly connected components (SCC) and merges them into a single core group • Converts CFSTG into a tree of core groups by replicating core groups as necessary

  28. Data Locality Rule • Default rule StartupObject initialstate • Maximize data locality to startup:3Mcyc; 100% improve performance finished 1 4 – Minimizes inter-core Aggregator merge communications Simulator aggregate:2Mcyc; 75% run aggregate:2Mcyc; 25% – Improves cache behavior finished StartupObject 4 1 Simulator Aggregator

  29. Data Parallelization Rule • To explore potential data StartupObject initialstate parallelism startup:3Mcyc; 100% finished 1 4 Aggregator merge Simulator aggregate:2Mcyc; 75% run aggregate:2Mcyc; 25% finished 1 StartupObject Simulator 1 1 StartupObject 4 Aggregator Simulator 1 Simulator Aggregator 1 1 Simulator Simulator

  30. Rate Matching Rule • If the producer executes Producer produce multiple times in a cycle, init how many consumers are Consumer produce run … required? • Match two rates to estimate the number of consumers Consumer – Peak new object creation rate Producer … … – Object consumption rate Consumer

  31. Mapping to Processor • Extended CFSTG 1 StartupObject Simulator 1 1 Aggregator Simulator 1 1 Simulator Simulator • Constraint: limited cores Core 1 Core 2 • Map CFSTG core groups to physical cores

  32. Mapping to Cores • One possible mapping 1 StartupObject Simulator 1 Core 1 1 Aggregator Simulator 1 1 Simulator Core 2 Simulator

  33. Mapping to Cores • Isomorphic mappings: have same performance 1 1 StartupObject StartupObject Simulator Simulator 1 1 Core 1 1 1 Aggregator Aggregator Core 1 Simulator Simulator Core 2 1 1 Simulator Core 2 Simulator 1 1 Simulator Simulator • Backtracking-based search: to generate non-isomorphic implementations

  34. Implementation Generation Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Bamboo Compiler

Recommend


More recommend