using prov o to represent lineage in statistical
play

Using PROV-O to represent lineage in statistical processes: a record - PowerPoint PPT Presentation

Using PROV-O to represent lineage in statistical processes: a record linkage example Flavio Rizzolo Statistics Canada Guillaume Dufges Institut National de la Statistique et des tudes conomiques Franck Cotton Institut National de


  1. Using PROV-O to represent lineage in statistical processes: a record linkage example Flavio Rizzolo – Statistics Canada Guillaume Dufges – Institut National de la Statistique et des Études Économiques Franck Cotton – Institut National de la Statistique et des Études Économiques

  2. Contents Context and objectives Record linkage and lineage metadata What is PROV-O? PROV-O representations for record linkage lineage Conclusions and future work SemStats 2019 2 Lineage metadata for record linkage with PROV-O

  3. Context and objectives Statistical ofgices need to provide trusted data Information on how data was produced helps doing that Provenance and lineage metadata are information on Processes and methods used Actors involved (data providers, owners, publishers, etc.) Relations between data outputs and data sources That metadata should Use a standard model in order to be easily understandable Be accessible and (machine-)usable SemStats 2019 3 Lineage metadata for record linkage with PROV-O

  4. Context and objectives Main goal of the paper: proof of concept about using the PROV model to represent lineage information on statistical processes Record linkage chosen as example process Sufgiciently complex, but not too much Widely used in statistical production Formal descriptions already available Lineage metadata can be defined at various levels of detail Various sofuware packages exist SemStats 2019 4 Lineage metadata for record linkage with PROV-O

  5. Context and objectives This is a very practical work, not groundbreaking research SemStats 2019 5 Lineage metadata for record linkage with PROV-O

  6. Record linkage and lineage metadata Record linkage Matching of data about real-world entities (people, businesses, products…) coming from difgerent data sources Match Match Typical process Source A Possible Source B Expert review Automatic Non-match Non-match matching Widely used (e.g. data integration), lots of methodological work Even a dedicated record linkage process model (Statistics Canada) SemStats 2019 6 Lineage metadata for record linkage with PROV-O

  7. Record linkage and lineage metadata Lineage model SemStats 2019 7 Lineage metadata for record linkage with PROV-O

  8. Record linkage and lineage metadata Types of lineage metadata Dataset lineage A dataset is derived from others by record linkage: keep track of sources and transformations applied Record lineage Track where the record comes from or which records are its contributors and what integration was applied Variable lineage Track how a variable (e.g. linkage key) is derived from variables in source datasets Data point lineage Not used for record linkage but heavily used in upstream tasks like data cleansing SemStats 2019 8 Lineage metadata for record linkage with PROV-O

  9. What is PROV-O? W3C recommendation part of the PROV familly (provenance metadata) wasDerivedFrom OWL2 expression of the PROV data model Starting point Entity terms wasAttributedTo Simple “Starting wasGeneratedBy point” model, Agent used expanded terms actedOnBehalfOf wasAssociatedWith and qualification Activity mechanism startedAtTime endedAtTime wasInformedBy xsd:dateTime xsd:dateTime SemStats 2019 9 Lineage metadata for record linkage with PROV-O

  10. What is PROV-O? wasInfluencedBy / xsd:dateTime xsd:dateTime wasQuotedFrom / wasRevisionOf / hadPrimarySource generatedAtTime invalidatedAtTime value Agent Entity alternateOf / specializationOf Person Collection Bundle Plan Organization wasInvalidatedBy hadMember SoftwareAgent wasStartedBy / Location wasEndedBy atLocation Expanded terms Activity SemStats 2019 10 Lineage metadata for record linkage with PROV-O

  11. What is PROV-O? Qualification mechanism SemStats 2019 11 Lineage metadata for record linkage with PROV-O

  12. PROV-O representations for record linkage lineage Simple example: the high-level view SemStats 2019 12 Lineage metadata for record linkage with PROV-O

  13. PROV-O representations for record linkage lineage Simple example: the high-level view SemStats 2019 13 Lineage metadata for record linkage with PROV-O

  14. PROV-O representations for record linkage lineage Simple example: the high-level view SemStats 2019 14 Lineage metadata for record linkage with PROV-O

  15. PROV-O representations for record linkage lineage Simple example: the high-level view SemStats 2019 15 Lineage metadata for record linkage with PROV-O

  16. PROV-O representations for record linkage lineage The record linkage process (simplified) SemStats 2019 16 Lineage metadata for record linkage with PROV-O

  17. PROV-O representations for record linkage lineage Produce linkage-ready datasets – process SemStats 2019 17 Lineage metadata for record linkage with PROV-O

  18. PROV-O representations for record linkage lineage Produce linkage-ready datasets – PROV-O representation SemStats 2019 18 Lineage metadata for record linkage with PROV-O

  19. PROV-O representations for record linkage lineage Produce linkage keys – process SemStats 2019 19 Lineage metadata for record linkage with PROV-O

  20. PROV-O representations for record linkage lineage Produce linkage keys – PROV-O representation – blocking SemStats 2019 20 Lineage metadata for record linkage with PROV-O

  21. PROV-O representations for record linkage lineage Produce linkage keys – PROV-O representation – linking SemStats 2019 21 Lineage metadata for record linkage with PROV-O

  22. Conclusions and future work Proof of concept conclusive PROV-O can be used to represent the process Using PROV-O allows to represent coherently the difgerent levels of lineage metadata The “russian dolls” nature of the PROV-O model implies that metadata can be produced at difgerent levels Example of queries that can be made List output datasets produced from a given data sources Which dataset(s) does this record come from? SemStats 2019 22 Lineage metadata for record linkage with PROV-O

  23. Conclusions and future work Future work Continue work on record linkage, in particular on the representation of methodology Test how to automate the production of metadata in usual sofuware Study the possibility to activate metadata (i.e. use it as specification) Adapt to other statistical operations (e.g. data editing, variable derivation...) Promote the work in the Ofgicial Statistics community SemStats 2019 23 Lineage metadata for record linkage with PROV-O

  24. Thank you for your attention Any questions? SemStats 2019 24 Lineage metadata for record linkage with PROV-O

Recommend


More recommend