See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/255580215 Description of the general reference scenario and presentation of metadata Article · January 2002 CITATIONS READS 0 40 6 authors , including: Cinzia Cappiello Chiara Francalanci Politecnico di Milano Politecnico di Milano 141 PUBLICATIONS 2,450 CITATIONS 132 PUBLICATIONS 2,309 CITATIONS SEE PROFILE SEE PROFILE Paolo Missier Newcastle University 192 PUBLICATIONS 3,131 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: E2mC Horizon2020 View project Data Quality View project All content following this page was uploaded by Paolo Missier on 02 October 2014. The user has requested enhancement of the downloaded file.
DaQuinCIS Metodologie e Strumenti per la Qualit` a dei Dati in Sistemi Informativi Cooperativi http://www.dis.uniroma1.it/ ∼ dq/ Programma di Ricerca Cofinanziato dal MIUR (Esercizio 2001) Description of the general reference scenario and presentation of metadata Cinzia Cappiello, Chiara Francalanci, Paolo Missier, Barbara Pernici, Pierluigi Plebani, Monica Scannapieco, Sommario This report is aimed at the definition of the metrics of data quality based on processes and at the definition of metadata on which a certificate can be produced to certify the quality of data. In particular, data quality dimensions are classified into four categories: subject dimensions, object dimensions, process dimensions, and architectural dimensions. For each dimensions the metadata involved in measurement process are described. A selection of relevant data quality dimensions in a CIS context is proposed in order to focus the project activity on this apropos set. Finally, the structure of quality certificate that can be associated with data flowing across different organizations is defined. 20 Dicembre 2002 Data Tipo di prodotto Rapporto tecnico 38 Numero di pagine MIP Unit` a responsabile Unit` a coinvolte MIB, MIP, RM Autore da contattare
Description of the general reference scenario and presentation of metadata Cinzia Cappiello, Chiara Francalanci, Paolo Missier, Barbara Pernici, Pierluigi Plebani, Monica Scannapieco, 20 Dicembre 2002 Abstract This report is aimed at the definition of the metrics of data quality based on processes and at the definition of metadata on which a certificate can be produced to certify the quality of data. In particular, data quality dimensions are classified into four categories: subject dimensions, object dimensions, process dimensions, and architectural dimensions. For each dimensions the metadata involved in measurement process are described. A selection of relevant data quality dimensions in a CIS context is proposed in order to focus the project activity on this apropos set. Finally, the structure of quality certificate that can be associated with data flowing across different organizations is defined. 1
1 Introduction Data quality is an increasingly critical aspect of the quality of service in the majority of information-intensive businesses. In business contexts where each organization can exclusively access internal data, the primary goal of data quality assurance is the continuous control of data values and, possibly, their improvement. In cooperative information systems (CIS), involving multiple organizations that must share data to reach a common goal, quality assurance is faced by the need for objective measures and evaluations of data quality that can be exchanged along with corresponding data. In addition, in a context in which interacting organizations may not be familiar with each other, approaches to certify the quality of exchanged data are important to be able to evaluate incoming information. This report focuses on two fundamental aspects of data quality management in CISs. First, it classifies data quality dimensions and it surveys their definitions and measures in previous literature. Second, it proposes a set of relevant dimensions that have to be analyzed in a CIS context. In Section 2, a model for exporting data and quality data is presented; in Sections 3-8 a whole set of data quality dimensions is described with the presentation of metadata that are able to allow data quality evaluation. In Section 9 a basic set of frequently used dimensions is proposed as relevant for CIS, and a description of the data quality certificate is provided in Section 10. The D 2 Q Model 2 All cooperating organizations export data quality dimension values evaluated for the application data according to a specific data model. The model for exporting data and quality data is referred to as Data and Data Quality ( D 2 Q ) model . In defining the model, for simplicity we consider a set of only four dimensions, namely: accuracy, completeness, internal consistency and currency. However, such a set includes all the considered dimensions (i.e. in object, process and architectural categories) . 2.1 Data Model The D 2 Q model is inspired by the data model underlying XML-QL [4]. A database view of XML is adopted: an XML Document is a set of data items, and a Document Type Definition (DTD) is the schema of such data items, consisting of data and quality classes . In particular, a D 2 Q XML document contains both application data, in the form of a D 2 Q data graph , and the related data quality values, in the form of four D 2 Q quality graphs , one for each considered quality dimension.. Specifically nodes of the D 2 Q data graph are linked to the corresponding ones of the D 2 Q quality graphs through links, as shown in Figure 1. As a running example, consider the document citizens.xml , shown in Figure 2, which contains entries about citizens with the associated quality data. Such a document corresponds to a set of conceptual data items, which are instances of conceptual schema elements; schema elements are data and quality classes, and instances are data and quality objects. Specifically, an instance of Citizen and the related Accuracy values are depicted. Data classes and objects are straightforwardly represented as D 2 Q data graphs, as detailed in the following of this section, and quality classes and objects are represented as D 2 Q quality graphs, as detailed in Section 2.2. In order to clarify our definition of data class in XML, we preliminary recall a typical defi- nition of data class from ODMG [5]. A data class δ ( π 1 , . . . , π n ) consists of: • a name δ ; 2
• a set of property tuples π i = < name i : type i > , i = 1 . . . n , n ≥ 1 , where name i is the name of the property π i and type i can be: – either a basic type 1 ; – or a data class; – or a type set-of < X > , where < X > can be either a basic type or a data class. A D 2 Q data graph G is a graph with the following features: • a set of nodes N ; each node (i) is identified by an object identifier and (ii) is the source of 4 different links to quality objects, each one for a different quality dimension. A link is a pair attribute-value, in which attribute represents the specific quality dimension for the element tag and value is an IDREF link 2 ; • a set of edges E ⊂ N × N ; each edge is labeled by a string, which represents an element tag of an XML document; • a single root node R ; • a set of leaves; leaves are nodes that (i) are not identified and (ii) are labeled by strings, which represent element tag values, i.e., the values of the element tags labeling edges to them. Data class instances can be represented as D 2 Q data graphs, according to the following rules. Let δ ( π 1 , . . . , π n ) be a data class with n properties, and let O be a data object, i.e., an instance of the data class. Such an instance is represented by a D 2 Q data graph G as follows: • The root R of G is labeled with the object identifier of the instance O . • For each π i = < name i : type i > the following rules hold: – if type i is a basic type, then R is connected to a leaf lv i by the edge < R , lv i > ; the edge is labeled with name i and the leaf lv i is labeled with the property value O . name i ; – if type i is a data class, then R is connected to the D 2 Q data graph which represents the property value O ′ = O . name i by an edge labeled with name i ; – if type i is a set-of < X > , then: ∗ let C be the cardinality of O . name i ; R is connected to C elements as it follows: if (i) < X > is a basic type, then the elements are leaves (each of them labeled with a property value of the set); otherwise if (ii) < X > is a data class, then the elements are D 2 Q data graphs, each of them representing a data object of the set; ∗ edges connecting the root to the elements are all labeled with name i . In Figure 3, the D 2 Q data graph of the running example is shown: an object instance Maria Rossi of the data class Citizen is considered. The data class has Name and Surname as properties of basic types, a property of type set-of < TelephoneNumber > and another property of data class type ResidenceAddress ; the data class ResidenceAddress has all properties of basic types. 1 Basic types are the ones provided by the most common programming languages and SQL, that is Integer , Real , Boolean , String , Date , Time , Interval , Currency , Any . 2 The use of links will be further explained in Section 2.2, when quality graphs are introduced. 3
Recommend
More recommend