cnc for tuning hints on ocr
play

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th - PowerPoint PPT Presentation

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015 Acknowledgements This work was done as part of my internship with the OCR team, part of Intel Federal, LLC at Jones Farm (Hillsboro, OR).


  1. CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015

  2. Acknowledgements This work was done as part of my internship with the OCR team, part of Intel Federal, LLC at Jones Farm (Hillsboro, OR). Mentors (Intel): Josh Fryman and Romain Cledat Habanero Team (Rice): Vivek Sarkar, Kath Knobe, Zoran Budimlić , and Sanjay Chatterjee 2

  3. Objective Demonstrate the effectiveness of OCR tuning hints by way of code generation from a higher- level programming model ( CnC ). 3

  4. Objective CnC Tunings CnC App Code Graph CnC-OCR Scaffolding hints OCR handler 4

  5. Open Community Runtime (OCR)* OCR project goals: • Provide effective abstraction for diverse hardware • Typify future task-based execution models • Handle large-scale parallelism efficiently • Maintain a separation of concerns (application/scheduling/resources) • Open source (encourage collaboration) * OCR ==> X- Stack Traleika Glacier project’s implementation 5

  6. Outline • Introduction • OCR Hints API • CnC on OCR • Tuning Hints Implementation and Analysis 6

  7. CnC / OCR Concept Mapping Concept OCR construct CnC construct Task classes (code) EDT template Step collection Task instance EDT Step instance All DBs have type void* (keeping track of individual DBs’ Data classes Item collection types is the app programmer's responsibility) Data instance Datablock Item instance Unique instance identifier GUID Tag (step tag / item key) Dependence registration Event add dependence Item get Dependence satisfaction Event satisfy Item put 7

  8. OCR Hints API: Example // Assume we have a template and a datablock ocrGuid_t edt ; ocrEdtCreate(& edt , template , 0, NULL, 1, NULL, EDT_PROP_NONE, NULL_GUID, NULL); { // Set an OCR hint ocrHint_t stepHints ; ocrHintInit(& stepHints , OCR_HINT_EDT_T); ocrGetHint( edt , & stepHints ); ocrSetHintValue(& stepHints , OCR_HINT_EDT_PRIORITY, 100); ocrSetHint( edt , & stepHints ); } ocrAddDependence( datablock , edt , 0, DB_DEFAULT_MODE); 8

  9. OCR Hints API: Pros Cons • Generic • Verbose • Conceptually decoupled • Placed in app source code • Light-weight • Limited expressiveness 9 9

  10. Outline • Introduction • OCR Hints API • CnC on OCR • Tuning Hints Implementation and Analysis 10

  11. CnC-OCR Developer Workflow debug Write Run translator tool Run program Flesh-out graph spec (functionality check) (produces skeleton project) skeleton code Write Re-run translator tool Re-run program tuning spec(s) (updates scaffolding code) (performance check) fine-tuning 11

  12. CnC-OCR + Tuning CnC Tunings CnC App Code Graph CnC-OCR Scaffolding hints OCR handler 12

  13. Separation of Concerns in CnC • Graph specification can be written without implementation details • Step function implementations written without knowledge of the external graph (only its own inputs and outputs) • Tuning specification given in a separate file • Easy to mix-in different tunings for performance testing • Try combinations of tunings until you find the ideal configuration 13

  14. Outline • Introduction • OCR Hints API • CnC on OCR • Tuning Hints Implementation and Analysis 14

  15. Tuning Hints Overview 1. Step / item distribution 2. Step affinity with input 3. Step priority 4. Scheduler throttling 5. Partial item requests 15

  16. Hint #1: Step / Item Distribution Functions • What? Declare a function for mapping individual step / item instances from a collection onto the set of OCR policy domains. • Why? – Distributed OCR currently lacks advanced schedule/placement heuristics. – Need control of distribution for a reasonable baseline. 16

  17. Smith-Waterman Sequence Alignment • Each input sequence length ~200k • Dynamic programming optimization on ~40-billion cell matrix • Tiles of 177x153 cells • Total of 1138x1322 tiles 17

  18. Smith-Waterman Specification Graph Specification Tuning Specification [ int above[] : i, j ]; [ above ]: { [ int left[] : i, j ]; distfn: (i / 16) % $RANKS [ SeqData *data : () ]; }; ( swStep: i, j ) [ left ]: { <- [ data: () ], distfn: (i / 16) % $RANKS [ above: i, j ] $when(i > 0), }; [ left: i, j ] $when(j > 0) -> [ below @ above: i+1, j ], ( swStep ): { [ right @ left: i, j+1 ], distfn: (i / 16) % $RANKS ( swStep: i+i, j ) $when(i+1 < #nth); }; 18 18

  19. 115.40 141.49 50 Average Execution Time (seconds) Smith-Waterman Sequence Alignment 40 • Each input sequence length ~200k 30 • Dynamic programming optimization on ~40-billion cell matrix 20 • Tiles of 177x153 cells • Total of 1138x1322 tiles 10 • Default: CnC default distribution 0 • Row-block: Rows in blocks 1 2 4 8 of 16 Node Count CnC-OCR Default CnC-OCR Row-Block • 10 runs per configuration iCnC Row-Block 19

  20. Hint #2: Step Affinity with Input Item • What? Declare that a step instance be affinitized with one of its input items. • Why? – OCR can use this affinity to improve scheduling heuristics. – More expressive way to specify tunings like hint #1. 20

  21. Smith-Waterman Specification Graph Specification Tuning Specification [ int above[] : i, j ]; [ above ]: { [ int left[] : i, j ]; distfn: (i / 16) % $RANKS [ SeqData *data : () ]; }; ( swStep: i, j ) [ left ]: { <- [ data: () ], distfn: (i / 16) % $RANKS [ above: i, j ] $when(i > 0), }; [ left: i, j ] $when(j > 0) -> [ below @ above: i+1, j ], ( swStep ): { [ right @ left: i, j+1 ], placeWith: above ( swStep: i+i, j ) $when(i+1 < #nth); }; 21 21

  22. Hint #3: Step Priority Weights • What? Express a priority weight for a given CnC step, such that steps with heavier weights should execute earlier. • Why? – Search problems: prioritize paths likely to find the answer sooner – Enable concurrency: prefer task with high-demand output (many consumers) 22

  23. ♛ N-Queens Puzzle • Board size: 13x13 ♛ • Solutions possible: 73,312 ♛ ♛ ♛ ♛ ♛ ♛ 23

  24. N-Queens Specification • Graph: [ u64 solutions[4]: i ]; ( placeQueen: row, board ) -> ( placeQueen: row+1, board_prime ), [ solutions: ? ]; • Tuning: ( placeQueen /* row, board */ ): { priority: row }; 24

  25. Implementation of Step Priority Weights Description Default Priority Location Scheduler Scheduler Base data structure deque bin-heap utils/ Scheduler interface deque bin-heap scheduler- wrapper object/ Scheduler (aggregate) wst pr-wsh scheduler- root object object/ Scheduler heuristic hc priority scheduler- behavior heuristic/ 25

  26. N-Queens Puzzle 4 • Board size: 13x13 Average execution time (seconds) • Solutions possible: 73,312 • Solutions sought: 5,000 3 • DEQ: Default work-stealing deque • DFS: Prioritize deep rows 2 • BFS: Prioritize shallow rows • 50 runs per configuration 1 0 DEQ DFS BFS 26

  27. Hint #4: Stoker Step (Scheduler Throttling) • What? Annotate the work-creating steps (which we call stokers ) so that the runtime can differentiate them from non-work-creating steps (which we call quenchers ). • Why? – If the scheduler has plenty of work to do, we can throttle by not running any more stoker steps for the time being. – For work stealing, we can prioritized stoker-steps for stealing, mitigates the need for more stealing in the near- term. 27

  28. Task-Bomb (Synthetic Example) • Root step creates Z=32 stoker steps quencher(0,0,0) • Each stoker creates … • Y=100 quencher tasks stoker(0,0) quencher(0,1,0) • quencher(0,0,Y) One stoker task … … • Recursion creates X=200 stoker(0,1) $initialize quencher(0,1,Y) levels quencher(Z,0,0) stoker(0,2) … … • Since the stoker is always stoker(Z,0) created last, we would quencher(Z,0,Y) expect all of the stokers to stoker(Z,1) … run in a depth-first manner when using the standard work-stealing deque scheduler 28

  29. Task-Bomb CnC Graph Spec [ void *done: () ]; ( stoker: i, j ) -> ( quencher: i, j, $rangeTo(Y) ), ( stoker: i, j+1 ) $when(j<X); ( quencher: i, j, k ) -> [ done: () ] $when(i==0 && j==X && k==Y); ( $initialize: () ) -> ( stoker: $range(Z), 0 ); ( $finalize: () ) <- [ done: () ]; 29

  30. Task-Bomb CnC Tunings Alternative 1: Alternative 2: Stoker / Quencher Priorities ( stoker ): { ( stoker ): { stoker: true priority: -1 }; }; 30 30

  31. Task-Bomb 4 (Synthetic Example) Average Execution Time (seconds) • Root step creates Z=32 3.5 stoker steps • Each stoker creates 3 • Y=100 quencher tasks 2.5 • One stoker task • Recursion creates X=200 2 levels 1.5 • Default scheduler dies 1 (deque overflow) ☠ • Stoker hint allows for 0.5 throttling • 0 Similar performance via priorities Default Priority Stoker 31

  32. Hint #5: Partial Item Inputs • What? Allow the programmer to specify that a step only accesses a sub-range of the bytes of an input item. • Why? – For distributed memory, can transfer just the part that will be accessed when an item is an input to a remote step. • Work In Progress 32

Recommend


More recommend