Data Quality Blueprint for Pentaho: Better Data Leads to Better Results Charles Gaddy Director Global Sales & Alliances, Melissa Data
Agenda • What is Data Quality, and What Role Does it Play? • 6 Concepts of Data Quality • Full Data Quality Lifecycle
What is Data Quality? • Data quality is an assessment of data’s fitness to serve its stated purpose. Think “Fitness for use.” • Data Quality can be any kind of data; scientific, transactional, customers, products, assets, locations, financial • Data cleansing is the verb used to achieve Data Quality.
Information Industry • The data governance discipline, the data quality discipline and the Master Data Management (MDM) discipline are closely related. • Data quality improvement is important within data governance and MDM. Furthermore you seldom see an MDM implementation without a (master) data governance work stream.
Information Industry What is data used for? Revenues 63% Service 54% Marketing 38% Risk Reduction 37% Channel Pipeline 36% New Projects 34% Regulatory 32% Survey Conducted by Melissa Global Intelligence
Atomic Domains of Data Quality • Basic data domains represent data such as: age, date of birth, and sales amount, that are common to many businesses. • Advanced data domains span the range of data classifications to provide more specific cases for your use. • In addition, the rule conditions for these advanced data rule definitions can be more complex.
Atomic Domains of Data Quality Atomic or Entity Domains that need special handling and available domain based knowledge Orders and Sales Location Personal Identity Asset Identity Financial Address Name Order Amount Age IP address Policy Zip Code Sales Amount Date of Birth Information Portfolio Latitude Order ID Longitude US SSN Phone Number Bank Account Number State CA SIN Email Address City Passport Number VIN Number Country
Advanced Atomic Domains of Data Quality Atomic or Entity Domains that need special handling and requires custom domain based knowledge Parts SKUs Product Items Assemblies
6 Concepts of Data Quality Duplication Integrity Accuracy Consistency Conformity Completeness
1. Completeness • Is all the requisite information available? • Are data values missing, or in an unusable state? In some cases, missing data is irrelevant, but when the information that is missing is critical to a specific business process, completeness becomes an issue.
2. Conformity • Are there expectations that data values conform to specified formats? • If so, do all the values conform to those formats? Maintaining conformance to specific formats is important in data representation, presentation, aggregate reporting, search, and establishing key relationships.
3. Consistency • Do distinct data instances provide conflicting information about the same underlying data object? • Are values consistent across data sets? • Do interdependent attributes always appropriately reflect their expected consistency? Inconsistency between data values plagues companies attempting to reconcile between systems and applications.
4. Accuracy • Do data objects accurately represent the “real-world” values they are expected to model? Incorrect spellings of product or person names, addresses, and even untimely or not current data can impact operational and analytical applications.
5. Duplication • Are there multiple, unnecessary representations of the same data objects within your data set? The inability to maintain a single representation for each entity across your systems poses numerous vulnerabilities and risks.
6. Integrity • What data is missing important relationship linkages? The inability to link related records together may actually introduce duplication across your systems. Not only that, as more value is derived from analyzing connectivity and relationships, the inability to link related data instance together impedes this valuable analysis.
Why Data is Always in Flux • 40 million Americans (1 in 6) move annually • More than 100,000 changes (adds, deletes, or modifications) every month • Quality of stored U.S. addresses declines 17% per year • Phone Area Code Splits • Email Domain Changes • Disconnected Phone Numbers
The Full Life Cycle of Data Quality
Profiling • Gathering Metadata for Analysis – Data about your data • Identify the Problems – NULLs/Blanks, Unnecessary Spaces, Incorrect Patterns, Unstandardized Data, etc. • Overall status of the Quality of Data – Statistical Analysis
Hygiene • Data Standardization/Normalization – Proper Casing – Proper Formatting – Removal of Unnecessary Characters • Data Cleansing – Misspellings – Parsing – Abbreviations
Data Verification • Verifying the actual content of data – Do the Addresses actually exist? – Are the Phone Numbers callable? – Are Email Addresses deliverable? – Are the names actually people’s names? – Do the Address, Name, Phone and Email correspond to each other?
Enrich and Update • Missing Information – Appending Fill in missing data pieces such as a missing phone number or email address • Enrichment of Data – Property Information, Geographic Information, Firmographics, Demographics • Retrieve the latest information (eg. Move Address and Latest Phone Number) – Data becomes outdated over time
Matching • De-Duplication – Duplicate data is bad data • Fuzzy Matching – Application of fuzzy logic algorithms for inexact matches • Deep Domain Knowledge – Handles matching problems in international data and in multiple domains
Merging • Golden Record Selection – Selection of the best record • Consolidation and Survivorship – Merging the best pieces of data according to intelligent rules
Monitoring • Profiling Over Time – Continuously gather metadata – Allows for maintenance of Data Quality – Data profiling with a good tool can also be employed as an active monitoring solution. – Active monitoring is something that can be employed to safeguard collected data – By using the same profiling techniques, it is possible to reassess the current state of the quality of data
Summary • What is Data Quality and what role it plays? • 6 concepts of Data Quality • Full Data Quality Lifecycle
Recommend
More recommend