Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet 1 Gilles Fedak 1 Matei Ripeanu 2 Samer Al-Kiswany 2 1 Inria, ENS Lyon, University of Lyon 2 University of British Columbia November 18th, 2013 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
Introduction Active Data Discussion Conclusion Outline Introduction Data Life Cycle Management Use-case Requirements Active Data Active Data: principles & features Exemple: Globus Online and iRODS Discussion Advantages Limitations Conclusion Related works Conclusion A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20
Introduction Active Data Discussion Conclusion Big Data ◮ Science and Industry have become data-intensive ◮ Volume of data produced by science and industry grows exponentially ◮ How to store this deluge of data? ◮ How to extract knowledge and sense? ◮ How to make data valuable? ◮ Some examples ◮ CERN’s Large Hadron Collider: 1.5PB/week ◮ Large Synoptic Survey Telescope, Chile: 30 TB/night ◮ Billion edge social network graphs ◮ Searching and mining the Web A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20
Introduction Active Data Discussion Conclusion Data Life Cycle Data Life Cycle ◮ Creation/Acquisition ◮ Transfer ◮ Replication ◮ Disposal/Archiving Definition The life cycle is the course of operational stages through which data pass from the time when they enter a system to the time when they leave it. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20
Introduction Active Data Discussion Conclusion Data Life Cycle Management Complicated scenarios ◮ Execution of workflow ◮ Complex interactions between software ◮ Need to quickly react to operational events Ad-hoc task-centric approaches ◮ Hard to program, maintain and debug ◮ No formal specification ◮ Complicates interactions between systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20
Introduction Active Data Discussion Conclusion Data Life Cycle Use-case Example: the Advanced Photon Source at Argonne National Lab ◮ 100TB of raw data per day ◮ Raw data are preprocessed and registered in a Globus dataset catalog ◮ Data are analyzed by various applications ◮ Results are stored in the dataset catalog and shared More analysis Upload result Remote Academic Data Center Cluster r e f s n a Analysis T r Instrument Transfer Local (Beamline) Storage a t a d a t e m t u l Extract & s e r r e Register Metadata t s Metadata i g e R Catalog A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20
Introduction Active Data Discussion Conclusion Use-case Task Centric Data Centric ◮ Independent scripts ◮ Express data-dependancies Vs ◮ Hard to program, maintain, verify ◮ Cross data-center coordination ◮ Coarse granularity ◮ User-level fault-tolerance ◮ Incremental processing A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20
Introduction Active Data Discussion Conclusion Requirements Challenges: a perfect system should. . . ◮ Simply represent the life cycle of data distributed across different data centers and systems ◮ Simplify DLM modeling and reasoning ◮ Hide the complexity resulting from using different infrastructures and systems ◮ Be easy to integrate with existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20
Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created Written Read Terminated • t 1 t 2 t 4 t 3 Each token has a unique identifier, corresponding to the actual data item’s. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created Written Read Terminated • t 1 t 2 t 4 t 3 A transition is fired whenever a data state changes. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created Written Read Terminated • t 1 t 2 t 4 public void handler () { computeMD5 (); t 3 } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion Active Data features The Active Data programming model and runtime environment: ◮ Allows to react to life cycle progression ◮ Exposes transparently distributed data sets ◮ Can be integrated with existing systems ◮ Has scalable performance and minimum overhead over existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20
Introduction Active Data Discussion Conclusion Implementation ◮ Prototype implemented in Java ( ≃ 2,800 LOC) ◮ Client/Service communication is Publish/Subscribe ◮ 2 types of subscription: ◮ Every transitions for a given data item ◮ Every data items for a given transition Active Data Client Service subscribe e b i r c s b Client u s Client A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion Implementation ◮ Several ways to publish transitions ◮ Instrument the code ◮ Read the logs ◮ Rely on an existing notification system ◮ The service orders transitions by time of arrival publish transition Active Data Client Service subscribe e b r i c s b Client u s n o t i s i n a r t h i s b l u Client p A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion Implementation ◮ Clients run transition handler code locally ◮ Transition handlers are executed ◮ Serially ◮ In a blocking way ◮ In the order transitions were published publish transition Active Data Client notify Service subscribe e b r i c s b Client u s notify n o t i s i n a r t h i s b l u Client p A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 35,000 Transitions per second 30,000 25,000 20,000 15,000 10,000 5,000 10 50 100 200 300 400 450 500 550 # clients Figure: Average number of transitions per second handled by the Active Data Service Clients publish 10,000 transitions in a row without pausing. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 35,000 Transitions per second 30,000 25,000 20,000 15,000 10,000 5,000 10 50 100 200 300 400 450 500 550 # clients Figure: Average number of transitions per second handled by the Active Data Service The prototype scales up to 30,000 transitions per seconds. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. ◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. ◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems − → What about heterogeneous systems? A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. ◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems − → What about heterogeneous systems? Example with Globus Online and iRODS File transfer service Data store and metadata catalog A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS Data events coming from Globus Online and iRODS Terminated Id: { GO: 7b9e02c4-925d-11e2 } • Created t 5 t 1 t 2 t 9 Put Get t 6 Failed Succeeded t 7 t 8 t 10 t 3 t 4 Created Terminated iRODS Globus Online A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS Data events coming from Globus Online and iRODS public void handler () { Terminated iput (...); } Created t 5 t 1 t 2 t 9 Put • Get t 6 Failed Succeeded t 7 t 8 t 10 t 3 t 4 Created Terminated iRODS Globus Online A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Recommend
More recommend