Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine
Motivation • Recent microprocessor trends – Number of cores increased rapidly – Architectures vary widely • Challenges for software development – Parallelization is now key for performance – Current parallel programming model: threads + locks • Hard to develop correct and efficient parallel software • Hard to adapt software to changes in architectures
Goals • Automatically generate parallel implementation • Automatically tune parallel implementation
Overview Profile Data Program Processor Specification Implementation Generator Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Code Generator Bamboo Compiler Optimized multi-core binary Multi-core Processor
Example • MonteCarlo Example – Partitions problem into several simulations – Executes the simulations in parallel – Aggregates results of all simulations
Bamboo Language • A hybrid language combines data-flow and Java – Programs are composed of tasks – Tasks compose with dataflow-like semantics – Tasks contain Java-like object-oriented code internally – Programs cannot explicitly invoke tasks – Runtime automatically invokes tasks • Supports standard object-oriented constructs including methods and classes
Bamboo Language • Flags – Capture current role (type state) of object in computation – Each flag captures an aspect of the object’s state – Change as the object’s role evolves in program – Support orthogonal classifications of objects
task startup(StartupObject s in initialstate) { class Simulator { flag run; Aggregator aggr = new Aggregator(s.args[0]){merge:=true}; flag submit; for(int i = 0; i < 4; i++) flag finished; Simulator sim = new Simulator(aggr){run:=true}; ... taskexit(s: initialstate:=false); } } class Aggregator { task simulate(Simulator sim in run) { flag merge; sim.runSimulate(); flag finished; taskexit(sim: run:=false, submit:=true); … } } task aggregate(Aggregator aggr in merge, Simulator sim in submit) { boolean allprocessed = aggr.aggregateResult(sim); if (allprocessed) taskexit(aggr: merge:=false, finished:=true; sim: submit:=false, finished:=true); taskexit(sim: submit:=false, finished:=true); }
Bamboo Program Execution Runtime new StartupObject initialization Global Flagged Object Space StartupObject initialstate state finished state
Bamboo Program Execution execute startup StartupObject on task Global Flagged Object Space StartupObject initialstate state finished state
Bamboo Program Execution set startup StartupObject task new Aggregator Simulator Simulator Simulator Simulator Global Flagged Object Space StartupObject initialstate state finished state merge state finished state Aggregator Simulator submit state run state finished state
Bamboo Program Execution StartupObject Aggregator execute execute simulate simulate Simulator Simulator simulate on on task task Simulator Simulator Global Flagged Object Space simulate simulate task execute on task execute on StartupObject initialstate state finished state merge state finished state Aggregator Simulator run state submit state finished state
Bamboo Program Execution StartupObject Aggregator set set simulate simulate Simulator Simulator task task Simulator Simulator Global Flagged Object Space simulate simulate set set task task StartupObject initialstate state finished state finished state Aggregator merge state Simulator run state submit state finished state
Bamboo Program Execution aggregate StartupObject task execute on Aggregator Simulator Simulator Simulator Simulator Global Flagged Object Space StartupObject initialstate state finished state Aggregator merge state finished state Simulator run state submit state finished state
Bamboo Program Execution aggregate StartupObject task Aggregator set Simulator Simulator Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state
Bamboo Program Execution StartupObject Aggregator execute on aggregate Simulator Simulator task Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state
Bamboo Program Execution StartupObject Aggregator set aggregate Simulator Simulator task Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state
Bamboo Program Execution StartupObject Aggregator aggregate Simulator Simulator task execute on Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state
Bamboo Program Execution StartupObject Aggregator aggregate Simulator Simulator task set Simulator Simulator Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state
Bamboo Program Execution StartupObject Aggregator Simulator Simulator aggregate task Simulator Simulator execute on Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state
Bamboo Program Execution StartupObject Aggregator Simulator Simulator aggregate task Simulator Simulator set Global Flagged Object Space StartupObject finished state initialstate state merge state finished state Aggregator Simulator submit state run state finished state
Implementation Generation Profile Data Bamboo Program Processor Specification Implementation Generator Candidate implementations Bamboo Compiler
Implementation Generation • Dependence Analysis: analyzes data dependence between tasks • Parallelism Exploration: extracts potential parallelism • Mapping to Cores: maps the program to real processor
Flag State Transition Graph (FSTG) Simulator run simulate:32Mcyc; 100% submit aggregate:2Mcyc; 100% finished
Combined Flag State Transition Graph (CFSTG) StartupObject initialstate Number of new objects startup:3Mcyc; 100% finished 4 1 Simulator Aggregator run merge aggregate:2Mcyc; 75% simulate:32Mcyc; 100% aggregate:2Mcyc; 25% finished submit aggregate:2Mcyc; 100% finished
Initial Mapping Core Group StartupObject initialstate startup:3Mcyc; 100% finished 1 4 Simulator Aggregator run merge aggregate:2Mcyc; 75% simulate:32Mcyc; 100% aggregate:2Mcyc; 25% finished submit aggregate:2Mcyc; 100% finished
Preprocessing Phase • Identifies strongly connected components (SCC) and merges them into a single core group • Converts CFSTG into a tree of core groups by replicating core groups as necessary
Data Locality Rule • Default rule StartupObject initialstate • Maximize data locality to startup:3Mcyc; 100% improve performance finished 1 4 – Minimizes inter-core Aggregator merge communications Simulator aggregate:2Mcyc; 75% run aggregate:2Mcyc; 25% – Improves cache behavior finished StartupObject 4 1 Simulator Aggregator
Data Parallelization Rule • To explore potential data StartupObject initialstate parallelism startup:3Mcyc; 100% finished 1 4 Aggregator merge Simulator aggregate:2Mcyc; 75% run aggregate:2Mcyc; 25% finished 1 StartupObject Simulator 1 1 StartupObject 4 Aggregator Simulator 1 Simulator Aggregator 1 1 Simulator Simulator
Rate Matching Rule • If the producer executes Producer produce multiple times in a cycle, init how many consumers are Consumer produce run … required? • Match two rates to estimate the number of consumers Consumer – Peak new object creation rate Producer … … – Object consumption rate Consumer
Mapping to Processor • Extended CFSTG 1 StartupObject Simulator 1 1 Aggregator Simulator 1 1 Simulator Simulator • Constraint: limited cores Core 1 Core 2 • Map CFSTG core groups to physical cores
Mapping to Cores • One possible mapping 1 StartupObject Simulator 1 Core 1 1 Aggregator Simulator 1 1 Simulator Core 2 Simulator
Mapping to Cores • Isomorphic mappings: have same performance 1 1 StartupObject StartupObject Simulator Simulator 1 1 Core 1 1 1 Aggregator Aggregator Core 1 Simulator Simulator Core 2 1 1 Simulator Core 2 Simulator 1 1 Simulator Simulator • Backtracking-based search: to generate non-isomorphic implementations
Implementation Generation Candidate implementations Simulation-based Evaluator Leading implementations Implementation Optimizer Optimized implementation Tuned implementations Bamboo Compiler
Recommend
More recommend