Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion Fault-Tolerance for PastryGrid Middleware erin 1 , Heithem Abbes 1 , 2 , Mohamed Jemni 2 , Yazid Christophe C´ Missaoui 2 1 LIPN, Universit´ e de Paris XIII, CNRS UMR 7030, France 2 UTIC, ESSTT, Universit´ e de Tunis, Tunisia HPGC’10 - IPDPS
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion Outlines Introduction 1 PastryGrid 2 Fault Tolerance in PastryGrid 3 Conclusion 4
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion Desktop Grid Architectures Desktop Grid Key Points Federation of thousand of nodes; !"#$%&'()&*#+,"%(+%-#( Internet as the 3 !"#$%&'()"*+&%,-($",$.%" /0#0'1$-(2."+&%,-($",$.%" communication layer: no 45"%+3+6*7(#+(#$"%8&," ")*&+',#--)*.#'*/+, "//$4*+#'/$1 !#$#%(0,1$&(2)'(0 6>>'(,&$(0# 3&(/2$.&,5*(.0 ?,-"*.'"% !#$#%&'&$( trust! =&5@+3+A&$&+3+<"$+ 3&(2)'( B?+3+?&#*C0D !" E%0$0,0'5 Volatility; local IP; Firewall !" 9(%":&'';<6= ! "#$%&!'()*+,-!-)(./ 0
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion Desktop Grid Architectures Desktop Grid Future Generation (in 2006) Distributed Architecture Architecture with !"#$%&'("%')*#+,-"#-.*" modularity: every = !"#$%&'()"*+&%,-($",$.%" :8#8';$-(<."+&%,-($",$.%" /01'($+$&0203*&$&+45#$6 7#$"%+#8*"+,8409 >0"%+=+?*4(#+(#$"%@&," component is ?11'(,&$(8# ")*&+', "//$4*+#'/$1 #--)*.#'*/+, 5.6&42)&$,78#(9(: A,-"*.'"% !#$#%(0,1 “configurable”: scheduler, B&02+=+C&$&+=+D"$+ $&(2)'(0 EA+=+A&#*F8G !#$#%&'&$( !" H%8$8,8'0 3&(2)'( storage, transport protocole ;#'#,<#+#=&$ 5.6&42)&$,78#(9(: I(%"J&''3D?B ! "#$%&!'()*+,-!-)(./ & Direct communications between peers; Security; Applications coming from any sciences (e-Science applications)
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion In search of distributed architecture PastryGrid An approach based on structured overlay network to discover (on the fly) the next node executing the next task
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion In search of distributed architecture PastryGrid An approach based on structured overlay network to discover (on the fly) the next node executing the next task Decentralizes the execution of a distributed application with precedences between tasks
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s overview Main objectives Fully distributed execution of task graph;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination; Dynamically creation of an execution environment;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination; Dynamically creation of an execution environment; No central element;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination; Dynamically creation of an execution environment; No central element;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s Terminology Task terminology Friend tasks: T 2 , T 3 share the same successor ( T 6 )
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s Terminology Task terminology Friend tasks: T 2 , T 3 share the same successor ( T 6 ) Shared tasks T 6 : has n > 1 ancestors ( T 2 , T 3 )
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s Terminology Task terminology Friend tasks: T 2 , T 3 share the same successor ( T 6 ) Shared tasks T 6 : has n > 1 ancestors ( T 2 , T 3 ) Isolated tasks T 4 , T 5 : have a single ancestor
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid’s Terminology Task terminology Example Friend tasks: T 2 , T 3 share the same successor ( T 6 ) Shared tasks T 6 : has n > 1 ancestors ( T 2 , T 3 ) Isolated tasks T 4 , T 5 : have a single ancestor
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid components Addressing scheme to identify applications and users (based on haching application name + submission date + user name — DHT (Pastry))
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid components Addressing scheme to identify applications and users (based on haching application name + submission date + user name — DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search.
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid components Addressing scheme to identify applications and users (based on haching application name + submission date + user name — DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search. Rendez-vous concept (RDV); Objectives: localisation of a node without IP; task coordination; data recovery;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid components Addressing scheme to identify applications and users (based on haching application name + submission date + user name — DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search. Rendez-vous concept (RDV); Objectives: localisation of a node without IP; task coordination; data recovery; coordination protocol between machines participating in the application.
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion PastryGrid components Addressing scheme to identify applications and users (based on haching application name + submission date + user name — DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search. Rendez-vous concept (RDV); Objectives: localisation of a node without IP; task coordination; data recovery; coordination protocol between machines participating in the application.
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator Known at the beginning;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator Known at the beginning; Central element on a decicated place;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management;
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload)
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator RDV Known at the beginning; Unknown; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload)
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator RDV Known at the beginning; Unknown; Central element on a Variable; decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload)
Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion RDV Concept Coordinator RDV Known at the beginning; Unknown; Central element on a Variable; decicated place; Failure: may still run; Failure: the system crashes; Centralized resource management; Management of all applications (overload)
Recommend
More recommend