towards an efficient fault tolerance scheme for glb
play

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, - PowerPoint PPT Presentation

GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18 GLB Fault Tolerance Scheme


  1. GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18

  2. GLB Fault Tolerance Scheme Experimental Results Global Load Balancing Global Load 1 Balancing Fault Tolerance 2 Scheme Experimental 3 Results 2 / 18

  3. GLB Fault Tolerance Scheme Experimental Results Worker-local Pools Examples: UTS: counting nodes in an unbalanced tree BC: calculate a property of each node in a graph 3 / 18

  4. GLB Fault Tolerance Scheme Experimental Results GLB Task pool framework for inter-place load balancing Utilizes cooperative work stealing Tasks are free of side effects and can spawn new task at execution time Final result computed by reduction Only one worker per place Worker-private pool 4 / 18

  5. GLB Fault Tolerance Scheme Experimental Results GLB’s main processing loop do { while (process(n)) { Runtime.probe(); distribute(); reject(); } } while (steal()); 5 / 18

  6. GLB Fault Tolerance Scheme Experimental Results Fault Tolerance Scheme Global Load 1 Balancing Fault Tolerance 2 Scheme Experimental 3 Results 6 / 18

  7. GLB Fault Tolerance Scheme Experimental Results Conceptual Ideas One backup-place per place (cyclic) Write backup periodically and when necessary (stealing) Exploit stealing-induces redundancy Write incremental backups whenever possible Each information at exactly two places 7 / 18

  8. GLB Fault Tolerance Scheme Experimental Results Incremental Backup of stable Tasks t-1 t R A R R R A send A A R A s R s ... s ... ... ... ... s s t-1 s t-2 s t-2 snap snap snap t-1 backup min t-2 min t-1 8 / 18

  9. GLB Fault Tolerance Scheme Experimental Results Actor Scheme No blocking constructs (except one outer finish) split and merge have to operate on the bottom of the Task Pool Actor Scheme Worker is passive entity (only processing tasks) Worker becomes active when a message is received Two kinds of messages: executed directly or stored and processed later → Worker stays responsive 9 / 18

  10. GLB Fault Tolerance Scheme Experimental Results Stealing Protocol Back(F) F V Back(V) t r y S t e a l V1 s t e a l - b a c k u p ● continue processing non-stolen ● valid = false tasks ● update ● record backup stolen tasks V2 in Open(F) k S T L a c g i v e V l i n k t o insert save link + process B F a c k V3 B V e n d e n d F valid = true At next backup of F: n t a l c r e m e n o n - i n ● update backup ● delete link d e l O p e n to V XYack ● delete Open(F) 10 / 18

  11. GLB Fault Tolerance Scheme Experimental Results Asynchronism 11 / 18

  12. GLB Fault Tolerance Scheme Experimental Results Asynchronism with Fault-Tolerance 12 / 18

  13. GLB Fault Tolerance Scheme Experimental Results Detection of dead Places Cannot use DeadPlaceException s Check relevant places regularly via isDead() , as well as the own backup place What if a place P is inactive? Does not check its backup-place for lifeness But its predecessor Forth(P) does check P If P is active, it checks lifeness of Back(P) Recursive process 13 / 18

  14. GLB Fault Tolerance Scheme Experimental Results Experimental Results Global Load 1 Balancing Fault Tolerance 2 Scheme Experimental 3 Results 14 / 18

  15. GLB Fault Tolerance Scheme Experimental Results Setup Experiments were conductet on an Infiniband-connected Cluster One place per node Up to 128 Nodes Configuration: small UTS: -d=13 large UTS: -d=17 15 / 18

  16. GLB Fault Tolerance Scheme Experimental Results UTS, small 60 GLB FTGLB 50 FTGLB-Incremental Time (seconds) 40 30 20 10 0 0 10 20 30 40 50 60 Places 16 / 18

  17. GLB Fault Tolerance Scheme Experimental Results UTS, small 2000 GLB FTGLB FTGLB-Incremental 1500 Time (seconds) 1000 500 0 0 10 20 30 40 50 60 Places 17 / 18

  18. GLB Fault Tolerance Scheme Experimental Results Thank you for your attention! Please feel free to ask questions. 18 / 18

Recommend


More recommend