DART: a Data Acquisition and Repairing Tool Bettina Fazzinga, Sergio Flesca, Filippo Furfaro and Francesco Parisi D.E.I.S. Università della Calabria {bfazzinga, flesca, furfaro, fparisi}@deis.unical.it International Workshop on Inconsistency and Incompleteness in Databases March 26, 2006 - Munich (Germany)
Motivation • Error-free acquisition of data is mandatory in several application scenarios – balance sheet analysis Balance sheet analysis tool analysis report electronic doc – generally balance sheets are available as paper documents, thus they cannot be processed by balance analysis tools, since these work only on electronic data
Motivation • Error-free acquisition of data is mandatory in several application scenarios – balance sheet analysis yes analysis acquisition consistent? acquired validation input tool phase data document no electronic doc paper doc correction constraints Current approach a massive human intervention is required • currently, integrity constraints defined on the input data are exploited only for validating acquired data • if data are inconsistent all the document portions involved into unsatisfied constraint must be checked for locating and correcting errors
Motivation • For instance OCR tool acquired document source document cash sales 100 100 + cash sales 100 120 = receivables 120 receivables 120 220 total cash receipts 250 220 - total cash receipts 220 120 + payment of accounts 120 payment of accounts 120 40 = long-term financing 40 long-term financing 40 total disbursement 160 total disbursements 160 160 = 160 net cash inflow 60 60 net cash inflow 60 a massive human intervention is required for correcting errors • constraints like those defined in the context of balance-sheet data can be express by aggregate constraints
Key Idea exploit integrity constraints for suggesting corrections yes acquisition consistent? acquired validation input phase data document no electronic doc paper doc compute a repair constraints correction the human intervention will be limited to verify only located suggestions
Key Idea exploit integrity constraints for suggesting corrections • For instance acquired document cash sales 100 receivables 120 DART suggests decreasing total cash receipts 250 the value down to 220 payment of accounts 120 long-term financing 40 total disbursement 160 net cash inflow 60 • in this case the operator will have to verify a single value instead of all the values in the table
Outline • Repairing strategies • DART architecture • Aggregate constraints • Steady aggregate constraints (SAC) • Computing a card-minimal repair
Repairing strategy • What is a reasonable strategy for repairing the acquired data? Tuple deletion / insertion The repaired cash budget The inconsistent cash budget Receipts Receipts cash sales 100 cash sales 100 100 + 120 + receivables 120 receivables 120 30 = 250 XXXXX 30 total cash 250 total cash 250 100 + 120 ≠ 250 Adding a new tuple means that the OCR tool skipped a whole row when acquiring ... It’s rather unrealistic !!!
Repairing strategy • What is a reasonable strategy for repairing the acquired data? • The most natural approach is updating directly the numerical data – Work at attribute-level, rather than tuple-level The repaired cash budget The inconsistent cash budget Receipts Receipts cash sales 100 cash sales 100 100 + 120 = receivables 120 receivables 120 220 total cash 250 total cash 220 100 + 120 ≠ 250 • In our context, we can reasonably assume that inconsistencies are due to symbol recognition errors • Thus, trying to re-construct the actual data values (without changing the number of tuples) is well founded
Card-minimal semantics The most probable case is that the acquiring system made the minimum number of errors Card-minimal semantics It means assuming that the minimum number of errors occurred A repair R is card -minimal for D iff there is no repair R’ for D consisting of fewer updates than R R Only two updates do not suffice to repair D!
Outline • Repairing strategies • DART architecture • Aggregate constraints • Steady aggregate constraints (SAC) • Computing a card-minimal repair
DART architecture Extraction Constraint Metadata Metadata Repairing Module Acquisition and Extraction electronic doc Module paper doc output data tabular input data
DART architecture - Acquisition and Extraction Module Extraction Constraint Metadata Metadata Extraction DB generator Repairing Module Wrapper Acquisition electronic doc Converter output data paper doc OCR tool
DART architecture - Repairing Module Extraction Constraint Metadata Metadata Extraction MILP DB generator transformer Wrapper MILP solver Acquisition validation interface electronic doc Converter paper doc OCR tool output data
Outline • Repairing strategies • DART architecture • Aggregate constraints • Steady aggregate constraints (SAC) • Computing a card-minimal repair
Aggregate constraints: the application context • A cash budget for a firm: Year 2004 Sections Receipts aggregate items are obtained by aggregating beginning cash 20 Subsections cash sales 100 detail items of the same receivables 120 section total cash receipts 220 Disbursements payment of accounts 120 capital expenditure 0 long-term financing 40 total disbursements 160 Balance net cash inflow 60 ending cash balance 80
Aggregate constraints: the application context • A cash budget for a firm: Year 2004 Sections Receipts derived items beginning cash 20 Subsections are obtained using the cash sales 100 value of other item of receivables 120 any type and belonging total cash receipts 220 to any section Disbursements payment of accounts 120 capital expenditure 0 long-term financing 40 total disbursements 160 Balance net cash inflow 60 ending cash balance 80
Aggregate constraints: the application context • A cash budget satisfy some integrity constraints: 1) Year 2004 for each section, the sum Receipts of all detail items must be beginning cash 20 equal to the value of the cash sales 100 100 + aggregate item 120 = receivables 120 220 total cash receipts 220 Disbursements payment of accounts 120 120 + capital expenditure 0 0 + 40 = long-term financing 40 160 total disbursements 160 Balance net cash inflow 60 ending cash balance 80
Aggregate constraints: the application context • A cash budget satisfy some integrity constraints: 2) Year 2004 the net cash inflow must be Receipts equal to the difference beginning cash 20 between total cash receipts cash sales 100 and total disbursements receivables 120 total cash receipts 220 220 - Disbursements 160 = payment of accounts 120 60 capital expenditure 0 long-term financing 40 total disbursements 160 Balance net cash inflow 60 ending cash balance 80
From the paper document to its digitized version CashBudget Year 2004 Receipts Section Subsection Type Value beginning cash 20 Receipts beginning cash drv 20 cash sales 100 Receipts cash sales det 100 receivables 120 Receipts receivables det 120 total cash receipts 220 Receipts total cash receipts aggr 250 Disbursements Disbursements payment of accounts det 120 Disbursements capital expenditure det 0 payment of accounts 120 Disbursements long-term financing det 40 capital expenditure 0 Disbursements total disbursements aggr 160 long-term financing 40 Balance net cash inflow drv 60 total disbursements 160 Balance ending cash balance drv 80 Balance net cash inflow 60 ending cash balance 80 Acquisition and Extraction Module
Aggregate constraints • can express constraints like those defined in the context of balance-sheet data where: 1. is a conjunction of atoms 2. is a constant 3. The aggregation formula is the linear combination of aggregation functions with
Aggregation function • Relational scheme R(A 1 ,A 2 ,…A n ) – Measure attributes: numerical attributes representing measures • Such as weight, length, price, etc. Linear combination of attributes • Aggregation function Boolean formula on constants and attributes of R
Aggregate constraints • CashBudget(Section,Subsection,Type,Value) 1) Section Subsection Type Value for each section, the sum Receipts beginning cash drv 20 of all detail items must be Receipts cash sales det 100 equal to the value of the Receipts receivables det 120 aggregate item Receipts total cash receipts aggr 250 Disbursements payment of accounts det 120 Disbursements capital expenditure det 0 Aggregation function: Disbursements long-term financing det 40 Disbursements total disbursements aggr 160 Balance net cash inflow drv 60 Balance ending cash balance drv 80 Aggregate constraint:
Recommend
More recommend