Data Integrity Verification IIA Orange County Chapter November 13, 2015 �������������� ���� 1 Data Integrity Verification
Excel Transformed My Data! BEFORE AFTER ��������������������� �������������� 2 Data Integrity Verification
Michael Kano Senior Manager, Data Analytics Sunera LLC Michael is a Senior Manager with Sunera’s national data analytics practice. Michael has 20 years of experience in data analytics and internal audit with organizations in the USA, Canada, and Kuwait. He has 20 years of experience with ACL software, including 8 years as the leader of ACL Services Ltd.’s global training team. During his tenure at ACL Services, Michael helped drive the training business to new levels of revenues and profits by actively supporting the Sales team in pre-sales discussions. Michael’s most recent experience consists of four years with eBay, Inc.’s internal audit team as Manager, Audit Analysis. He was tasked with integrating data analytics into the audit workflow on strategic and tactical levels. This included developing quality and documentation standards, training users, and providing analytics support on numerous audits in the IT, PayPal, and eBay marketplaces business areas. He also provided support to non-IA teams such as the Business Ethics Office and Enterprise Risk Management teams. During his years at eBay, Michael supported audits throughout the organization in the IT, compliance, operations, vendor management, revenue assurance, T&E, and human resources areas. Michael also has 7 years of experience with Arbutus Software, and has managed the transition to Arbutus from other data analysis tools. He is a proficient user of Tableau, Microsoft Access, and Teradata SQL Assistant. 3
AGENDA � Defining data integrity verification (DIV) � Sources of integrity erosion � File-level testing � Field-level testing 4 Data Integrity Verification
Defining Data Integrity Verification 5 Data Integrity Verification
Data Integrity Verification (DIV) � The process by which the data analyst tests the data to determine whether it is acceptable for analysis � Tests should be carried out at both the file level and the field level before conducting any analytics. 6 Data Integrity Verification
The Risks of Integrity Erosion � Lost time � Incorrect conclusions � Revenue/cost � Security � Professional standing 7 Data Integrity Verification
Evidence of data integrity erosion � Blank/invalid entries � Missing records in key fields � Excess records � Incorrect/invalid � Duplicates formatting � Shifted fields � Invalid characters in � Skewed records data 8 Data Integrity Verification
Shifted Fields 9 Data Integrity Verification
Skewed Records 10 Data Integrity Verification
Sources of Integrity Erosion 11 Data Integrity Verification
Processing… 12 Data Integrity Verification
The Process 13 Data Integrity Verification
Sources of data integrity errors � Miscommunication of requirements � Extraction � Conversion � Transmission � Import � Manual edits � Data definition 14 Data Integrity Verification
Miscommunication � "All AP transactions between April and June, including all important fields." � "All AP payments and reversals between 4/1/2015 and 6/30/2015 (inclusive) including the following fields: <field list>. The output should be in a tab-delimited text file, and at no point should it pass through a spreadsheet or be opened in a spreadsheet application." 15 Data Integrity Verification
Conversion � Dropping leading zeros (ID numbers) � Converting date to numeric � Removing alphas from alphanumeric field � Use of delimiter that is included within a text field � Insertion of blank lines in Excel 16 Data Integrity Verification
Date Conversion 17 Data Integrity Verification
Manual Edits � Inadvertent/deliberate editing � How does that happen? � Sorting � Formatting � Copy/pasting 18 Data Integrity Verification
Data Definition � Record length � Field position � Formatting (date fields) 19 Data Integrity Verification
File-Level Testing 20 Data Integrity Verification
File-Level Testing � Structure � Content 21 Data Integrity Verification
Structure � Review metadata � Send table layout to a table in Arbutus/ACL � Compare field type/length/format to metadata 22 Data Integrity Verification
Content � Completeness � Run COUNT to document number of records � Run TOTAL on numeric fields for control totals � Uniqueness: Run DUPLICATES command selecting all fields to identify duplicate records � Validity: Run VERIFY against numeric and date fields 23 Data Integrity Verification
Field-Level Testing: Numerics 24 Data Integrity Verification
Numeric Fields: What to look for � Field total � Lowest value � Highest value � Average � Second-highest value � Range � Ratio of 2nd highest to highest � Absolute value � Median � Number of zeros � Number of positives � Number of negatives � Number of corrupt entries 25 Data Integrity Verification
Testing Numeric Fields � Run STATISTICS against all numeric fields � Look for zeros, negatives, bounds, highest/second-highest � Recalculate computed value with computed fields (e.g, Total_Amount = Price * Quantity) 26 Data Integrity Verification
Scripted Solution � Shows table/field names, and test date-time in a table � Provides comprehensive, standard test results � Faster and less error-prone than manual execution � 2 million records, 4 numeric fields in ~45 seconds � Also saves table layout for file with _TL suffix 27 Data Integrity Verification
Script Results: Numerics 28 Data Integrity Verification
Field-Level Testing: Dates 29 Data Integrity Verification
Date Fields: What to look for � Oldest � Weekends � Most recent � Blanks � Span of valid dates � Invalid non-blank dates 30 Data Integrity Verification
Testing Date Fields � Run STATISTICS against all date fields � Blanks/invalids/weekends � Bounds � Test related fields, e.g., PO_Date <= Invoice_Date � Test for completeness (24/7 data) with GAPS command 31 Data Integrity Verification
Blank Dates & Formatting � Entire date column is blank = Incorrect format in field definition. � Edit >> Table Layout to review and correct format 32 Data Integrity Verification
Formatting Date Fields 33 Data Integrity Verification
Dates: Scripted Solution Data Integrity Verification 34
Field-Level Testing: Characters 35 Data Integrity Verification
Character Fields: What to look for Item Functionality Blanks ISBLANK(<key>) Invalid entries CLASSIFY ON <key> CLASSIFY ON FORMAT(<key>) Duplicates DUPLICATES ON <key> 36 Data Integrity Verification
Character Fields: Formats � Verify that format is valid � May need to scrub � PO numbers, customer IDs, phone numbers, zip codes � Use FORMAT() function in CLASSIFY to display list of unique formats CLASSIFY ON FORMAT(<field name>) TO "<output file>" OPEN 37 Data Integrity Verification
Output of CLASSIFY + Format() � 1 record per format � Shows frequency x= lower-case alpha X = upper-case alpha 9 = numeric Blanks/special characters 38 Data Integrity Verification
Mitigating Integrity Risk 39 Data Integrity Verification
Key Items � Know your data � Obtain data independently (SQL?) � Short chain from extraction to analysis � Automated DIV 40 Data Integrity Verification
The Process 41 Data Integrity Verification
The New Process 42 Data Integrity Verification
Benefits � Independence � Confidence � Shorter time � Comprehensive DIV 43 Data Integrity Verification
Any questions? Michael Kano, ACDA mkano@sunera.com 44 Data Integrity Verification
Recommend
More recommend