subgroup discovery exploratory data analysis exploratory
play

Subgroup Discovery Exploratory Data Analysis Exploratory Data - PowerPoint PPT Presentation

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification: model the dependence of the target on the remaining attributes. problem: sometimes uses only some of the available dependencies, or classifier is


  1. Subgroup Discovery

  2. Exploratory Data Analysis

  3. Exploratory Data Analysis § Classification: model the dependence of the target on the remaining attributes. § problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing.

  4. Exploratory Data Analysis § Classification: model the dependence of the target on the remaining attributes. § problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing. § Exploratory Data Analysis: understanding the effects of all attributes on the target.

  5. Exploratory Data Analysis § Classification: model the dependence of the target on the remaining attributes. § problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing. § Exploratory Data Analysis: understanding the effects of all attributes on the target. Q: How can we use ideas from C4.5 to approach this task?

  6. Exploratory Data Analysis § Classification: model the dependence of the target on the remaining attributes. § problem: sometimes uses only some of the available dependencies, or classifier is a black-box. § for example: in decision trees, some attributes may not appear because of overshadowing. § Exploratory Data Analysis: understanding the effects of all attributes on the target. Q: How can we use ideas from C4.5 to approach this task? A: Why not list the info gain of all attributes, and rank according to this?

  7. Interactions between Attributes § Single-attribute effects are not enough § XOR problem is extreme example: 2 attributes with no info gain form a good model § Apart from A =a, B =b, C =c, … § consider also A =a ∧ B =b, A =a ∧ C =c, …, B =b ∧ C =c, … A =a ∧ B =b ∧ C =c, … …

  8. Subgroup Discovery Task “Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute” § Inductive constraints: § Minimum support § (Maximum support) § Minimum quality (Information gain, X 2 , WRAcc) § Maximum complexity § …

  9. Confusion Matrix § A confusion matrix (or contingency table ) describes the frequency of the four combinations of subgroup and target: § within subgroup, positive § within subgroup, negative § outside subgroup, positive target T F T .42 .13 .55 subgroup F .12 .33 .54 1.0

  10. Confusion Matrix § High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed target T F T .42 .13 .55 subgroup F .12 .33 .45 .54 .46 1.0

  11. Confusion Matrix § High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed target T F T .42 .13 .55 subgroup F .12 .33 .45 .54 .46 1.0

  12. Confusion Matrix § High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed target T F T .42 .13 .55 subgroup F .12 .33 .45 .54 .46 1.0

  13. Confusion Matrix § High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed target T F T .42 .13 .55 subgroup F .12 .33 .45 .54 .46 1.0

  14. Confusion Matrix § High numbers along the TT-FF diagonal means a positive correlation between subgroup and target § High numbers along the TF-FT diagonal means a negative correlation between subgroup and target § Target distribution on DB is fixed target T F T .42 .13 .55 subgroup F .12 .33 .45 .54 .46 1.0

  15. Quality Measures A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number WRAcc, weighted relative accuracy § WRAcc (S,T) = p (ST) – p (S) ⋅ p (T) § between − .25 and .25, 0 means uninteresting § Balance between coverage and unexpectedness target T F T .42 .13 .55 WRAcc (S,T) = p (ST) − p (S) ⋅ p (T) subgroup = .42 − .297 = .123 F .12 .33 .54 1.0

  16. Quality Measures § WRAcc: Weighted Relative Accuracy § Information gain § X 2 § Correlation Coefficient § Laplace § Jaccard § Specificity

  17. Subgroup Discovery as Search T

  18. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2

  19. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2 T F .55 T .42 .13 F .12 .33 .54 1.0

  20. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2

  21. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2 … A =a 1 ∧ B =b 1 … A =a 2 ∧ B =b 1 A =a 1 ∧ B =b 2

  22. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2 … A =a 1 ∧ B =b 1 … A =a 2 ∧ B =b 1 A =a 1 ∧ B =b 2 A =a 1 ∧ B =b 1 ∧ C=c 1 …

  23. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2 … A =a 1 ∧ B =b 1 … A =a 2 ∧ B =b 1 A =a 1 ∧ B =b 2 A =a 1 ∧ B =b 1 ∧ C=c 1 … minimum support level reached

  24. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2 … A =a 1 ∧ B =b 1 … A =a 2 ∧ B =b 1 A =a 1 ∧ B =b 2 A =a 1 ∧ B =b 1 ∧ C=c 1 …

  25. Subgroup Discovery as Search T A =a 1 B =b 1 B =b 2 C=c 1 … A =a 2 … A =a 1 ∧ B =b 1 … A =a 2 ∧ B =b 1 A =a 1 ∧ B =b 2 A =a 1 ∧ B =b 1 ∧ C=c 1 …

  26. Refinements are (anti-)monotonic entire database target concept

  27. Refinements are (anti-)monotonic entire database target concept subgroup S 1

  28. Refinements are (anti-)monotonic entire database target concept S 2 refinement of S 1 subgroup S 1

  29. Refinements are (anti-)monotonic entire database target concept S 3 refinement of S 2 S 2 refinement of S 1 subgroup S 1

  30. Refinements are (anti-)monotonic entire database Refinements are (anti-) monotonic in their support… target concept S 3 refinement of S 2 S 2 refinement of S 1 subgroup S 1

  31. Refinements are (anti-)monotonic entire database Refinements are (anti-) monotonic in their support… target concept …but not in interestingness. This may go up or down. S 3 refinement of S 2 S 2 refinement of S 1 subgroup S 1

  32. Subgroup Discovery and ROC space

  33. ROC Space ROC = Receiver Operating Characteristics Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate. TPR = TP/Pos = TP/TP+FN ( fraction of positive cases in the subgroup ) FPR = FP/Neg = FP/FP+TN ( fraction of negative cases in the subgroup )

  34. ROC Space Properties

  35. ROC Space Properties ‘ROC heaven’ perfect subgroup

  36. ROC Space Properties ‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup

  37. ROC Space Properties ‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup perfect negative subgroup

  38. ROC Space Properties entire database ‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup perfect negative subgroup empty subgroup

  39. ROC Space Properties entire database ‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup perfect negative subgroup empty minimum support subgroup threshold

  40. Measures in ROC Space source: Flach & Fürnkranz 0 positive negative WRAcc Information Gain

  41. Measures in ROC Space source: Flach & Fürnkranz 0 positive negative WRAcc Information Gain isometric

  42. Other Measures Precision Gini index Correlation coefficient Foil gain

  43. Refinements in ROC Space

  44. Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S.

  45. Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S.

  46. Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.

  47. Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. . . With a convex measure, f is bounded by measure of corners. . If corners are not above minimum quality or current .. best (top k ?), prune search space below S.

  48. Combining Two Subgroups

  49. Combining Two Subgroups

  50. Combining Two Subgroups

  51. Combining Two Subgroups

  52. Combining Two Subgroups

Recommend


More recommend