a decision model for cost optimal record matching
play

A Decision Model for Cost Optimal Record Matching Presenter: - PowerPoint PPT Presentation

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000 Comparison Vector Given a pair of database records


  1. A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000

  2. Comparison Vector • Given a pair of database records with partially overlapping schemata, decide whether it is a match or not. • Compare the pairs of values stored in each common attribute/field (assume n common fields). • The n comparison measurements form a comparison vector X .

  3. Record Comparison A B C D E F A B C D F 1. Agreement 2. Disagreement 3. Missing 1 1 3 2 2 1

  4. Random Vector • Even if a pair of records match, the observed value for each field comparison is different each time the observation is made. • Therefore, each field comparison variable is a random variable . • Likewise, the comparison vector X is a random vector .

  5. Distribution of Vectors • Each pair of records is expressed by a comparison vector (or a sample) in an n- dimensional space. • Many comparison vectors form a distribution of X in the n-dimensional space. • Figure 1 shows a simple two dimensional example of two distributions corresponding to matched and unmatched pairs of records.

  6. Figure 1 x2 x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x o x x o o x x x o x o o x x o o x x o x o o o o x o o o o o o o o o o o o o o o o o o o o o o o o o o o o x1 Distributions of samples from matched and unmatched record pairs.

  7. Classifiers • If we know these two distributions of X from past experience, we can set up a boundary between these two distributions, g(x1,x2)=0 , which divides the two- dimensional space into two regions. • Once the boundary is selected, we can classify a sample without a class label to a matched or unmatched, depending on the sign of g(x1,x2). • We call g(x1,x2) a discriminant function and a system that detects the sign of g(x1,x2) a classifier .

  8. Figure 2 x2 x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x o x x o o x x x o x o o x x o o x x o x o o o o x o o o o o o o o o o o o o o o o o o o o o o o g(x1,x2)= 0 o o o o o x1 Distributions of samples from matched and unmatched record pairs.

  9. Learning • In order to design a classifier, we must study the characteristics of the distribution of X for each category and find a proper discriminant function. • This process is called learning . • Samples used to design a classifier are called learning or training samples .

  10. Statistical Hypothesis Testing • What is the best classifier, assuming that the distributions of the random vectors are given? • Bayes classifier minimizes the probability of classification error .

  11. Distribution and Density Functions • Random vector X • Unconditional density function or mixture density function • Distribution function L P(X) ( ) P p X p(X)= � i i = i 1 • Density function p(X) • A posteriori density • Class i density or function P(c i |X) or q i (X) conditional density of class i p(X|c i ) or p i (X) • Bayes rule

  12. Bayes Rule for Minimum Error • Let X a comparison vector. • Determine whether X belongs to M or U. • If the a posteriori probability of M given X is larger than the probability of U, X is classified to M, and vice versa.

  13. Fellegi-Sunter Model • Order X’s based on their likelihood ratio ( ) p X = M ( ) l X ( ) p X U • For a pair of error levels ( µ , λ ), choose index values n and n’ such that: − 1 n n < µ ≤ ( ) ( ) p X p X � � U i U i = = 1 1 i i N N ≥ λ > ( ) ( ) p X p X � � M i M i ' 1 = ' = + i n i n

  14. Minimum Cost Model • Minimizing the probability of error is not the best criterion to design a decision rule because the misclassifications of M and U samples may have different consequences. • The misclassification of a cancer patient to normal may have a more damaging effect than the misclassification of a normal patient to cancer. • Therefore, it is appropriate to assign a cost to each situation.

  15. Decision Costs Cost Decision Class C 1M A 1 M C 1U A 1 U C 2M A 2 M C 2U A 2 U C 3M A 3 M C 3U A 3 U

  16. Mean Cost (I) = ⋅ = = + ⋅ = = + ( , ) ( , ) c c P d A c M c P d A c U 1 1 1 1 M U ⋅ = = + ⋅ = = + ( , ) ( , ) c P d A c M c P d A c U 2 2 2 2 M U ⋅ = = + ⋅ = = ( , ) ( , ) c P d A c M c P d A c U 3 3 3 3 M U

  17. Bayes Theorem = = = = = ⋅ = ( , ) ( | ) ( ) P d A c j P d A c j P c j i i = = where 1 , 2 , 3 and , i c M U

  18. Conditional Probability A 1 A 2 A 3 = = = = = ( | ) ( ), where 1 , 2 , 3 and , P d A c j p X i c M U � i j ∈ X A i = = π = = − π ( ) and ( ) 1 P c M P c U 0 0

  19. Mean Cost (II) Using the Bayes theorem: = ⋅ = = ⋅ = + ⋅ = = ⋅ = + ( | ) ( ) ( | ) ( ) c c P d A c M P c M c P d A c U P c U 1 1 1 1 M U ⋅ = = ⋅ = + ⋅ = = ⋅ = + ( | ) ( ) ( | ) ( ) c P d A c M P c M c P d A c U P c U 2 2 2 2 M U ⋅ = = ⋅ = + ⋅ = = ⋅ = ( | ) ( ) ( | ) ( ) c P d A c M P c M c P d A c U P c U 3 3 3 3 M U Using the definition of the conditional probability: = ⋅ π ⋅ + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c c p X c p X � � 1 0 1 0 M M U U ∈ ∈ X A X A 1 1 ⋅ ⋅ π + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c p X c p X � � 2 0 2 0 M M U U ∈ ∈ X A X A 2 2 ⋅ ⋅ π + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c p X c p X � � 3 0 3 0 M M U U ∈ ∈ X A X A 3 3

  20. Mean Cost (III) = ⋅ ⋅ π + ⋅ ⋅ − π + [ ( ) ( ) ( 1 )] c p X c p X c � 1 0 1 0 M M U U ∈ X A 1 ⋅ ⋅ π + ⋅ ⋅ − π + [ ( ) ( ) ( 1 )] p X c p X c � 2 0 2 0 M M U U ∈ X A 2 ⋅ π ⋅ + ⋅ ⋅ − π [ ( ) ( ) ( 1 )] p X c p X c � 3 0 3 0 M M U U ∈ X A 3

  21. Decision Areas • Every sample X in the decision space A, should be assigned to only one decision class: A 1 , A 2 or A 3 . • We should thus assign each sample to a class in such a way that its contribution to the mean cost is minimum. • This will lead to the optimal selection for the three sets which we denote by A 10 , A 20 , A 30 .

  22. Decision Making • A sample is assigned to the optimal areas as follows: 0 To if : A 1 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 1 0 1 0 2 0 2 0 M M U U M M U U ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 1 0 1 0 3 0 3 0 M M U U M M U U 0 To if : A 2 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c M 2 M 0 U 2 U 0 M 1 M 0 U 1 U 0 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ π ⋅ + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 2 0 2 0 3 0 3 0 M M U U M M U U 0 To if : A 3 ⋅ π ⋅ + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 3 0 3 0 1 0 1 0 M M U U M M U U ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ π ⋅ + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c M 3 M 0 U 3 U 0 M 2 M 0 U 2 U 0

  23. Optimal Decision Areas • We thus conclude from the previous slide: π − π − p c c p c c � � = ≤ ⋅ ≤ ⋅ 0 0 3 1 0 U M M U 2 M 1 M : and, A X � � − π − − π − 1 1 1 p c c p c c � 0 1 3 0 1 2 � M U U M U U π − π − p c c p c c � � = ≥ ⋅ ≤ ⋅ 0 0 2 1 0 3 2 U M M U M M : and, A X � � − π − − π − 2 1 1 p c c p c c � M 0 1 U 2 U M 0 2 U 3 U � π − π − p c c p c c � � = ≥ ⋅ ≥ ⋅ 0 0 3 1 0 3 2 : U M M and, U M M A X � � − π − − π − 3 1 1 p c c p c c � 0 1 3 0 2 3 � M U U M U U

  24. Threshold Values ≤ ≤ ≥ ≥ , c c c c c c 1 2 3 1 2 3 M M M U U U π − c c κ = ⋅ 0 3 1 M M − π − 1 c c 0 1 3 U U π − c c λ = ⋅ 0 2 1 M M − π − 1 c c 0 1 2 U U π − c c µ = ⋅ 0 3 2 M M − π − 1 c c 0 2 3 U U

  25. Threshold Values 0 to exist: • In order for A 2 π − π − c c c c λ = ⋅ ≤ ⋅ = µ 0 3 1 0 3 2 M M M M − π − − π − 1 1 c c c c 0 1 3 0 2 3 U U U U • We can easily prove now, that threshold κ lies between λ and µ .

Recommend


More recommend