advanced analytics in business d0s07a big data platforms
play

Advanced Analytics in Business [D0S07a] Big Data Platforms & - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Unsupervised Learning Anomaly Detection Overview Frequent itemset and association rule mining Other itemset extensions Clustering Anomaly detection 2


  1. Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Unsupervised Learning Anomaly Detection

  2. Overview Frequent itemset and association rule mining Other itemset extensions Clustering Anomaly detection 2

  3. The analytics process 3

  4. Recall Predictive analytics (supervised learning) Predict the future based on patterns learnt from past data Classification (categorical) versus regression (continuous) You have a labelled data set at your disposal Descriptive analytics (unsupervised learning) Describe patterns in data Clustering, association rules, sequence rules No labelling required For unsupervised learning, we don't assume a label or target 4

  5. Frequent itemset and association rule mining 5

  6. Introduction Association rule learning is a method for discovering interesting relations between variables Interesting? Frequent, rare, costly, strange? Intended to identify strong rules discovered in databases using some measures of interestingness For example, the rule {onions, tomatoes, ketchup} → {burger} found in the sales data of a supermarket would indicate that if a customer buys onions, tomatoes and ketchup together, they are likely to also buy hamburger meat, which can be used e.g. for promotional pricing or product placements Application areas in market basket analysis, web usage mining, intrusion detection, production and manufacturing Association rule learning typically does not consider the order of items either within a transaction (sequence mining does) Pioneering technique: apriori algorithm (Rakesh Agrawal, 1993) 6

  7. {beer, diapers}? https://www.itbusiness.ca/news/behind-the-beer-and-diapers-data-mining- legend/136 1992, Thomas Blischok, manager of a retail consulting group at Teradata Prepared an analysis of 1.2 million market baskets from 25 Osco Drug stores Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m., consumers bought beer and diapers" Osco managers did not exploit the beer and diapers relationship by moving the products closer together 7

  8. {lotion, calcium, zinc}? http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a- teen-girl-was-pregnant-before-her-father-did/ Before long some useful patterns emerged Women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc When someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date 25 products that, when analyzed together, allowed to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy “My daughter got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” 8

  9. Transactional database Every instance now represents a transaction Features correspond with columns: binary categoricals Tr. ID milk bread beer cheese wine spaghetti 101 1 1 1 0 0 0 102 0 1 1 1 1 0 103 1 1 0 1 0 1 104 0 0 0 1 1 1 105 1 1 0 1 1 1 9

  10. Mining interesting rules What consitutes a "good" rule? To select rules, constraints on various measures of "interest" are used Most known measures/constraints: minimum thresholds on support and confidence ∣ t ∈ T : X ⊆ t ∣ Support ( X ⊆ I , T ) = ∣ T ∣ Tr. ID milk bread beer cheese wine spaghetti 101 1 1 1 0 0 0 102 0 1 1 1 1 0 103 1 1 0 1 0 1 104 0 0 0 1 1 1 105 1 1 0 1 1 1 Support({milk, bread, cheese}) = 2/5 = 0.4 10

  11. Mining interesting rules What consitutes a "good" rule? To select rules, constraints on various measures of "interest" are used Most known measures/constraints: minimum thresholds on support and confidence Support ( X ∪ Y , T ) Confidence ( X ⊆ I ⇒ Y ⊆ I , T ) = Support ( X , T ) Tr. ID milk bread beer cheese wine spaghetti 101 1 1 1 0 0 0 102 0 1 1 1 1 0 103 1 1 0 1 0 1 104 0 0 0 1 1 1 105 1 1 0 1 1 1 Confidence({cheese, wine} → {spaghetti}) = 0.4 / 0.6 = 0.66 Can be interpreted as an estimate of the conditional probability P ( Y ∣ X ) 11

  12. Mining interesting rules Other measures exist as well: Support ( X ∪ Y , T ) Lift ( X ⊆ I ⇒ Y ⊆ I , T ) = Support ( X , T )× Support ( Y , T ) If lift = 1, the probability of occurrence of the antecedent and consequent are independent of each other If lift > 1, signifies the degree of dependence Considers both the confidence of the rule and the overall data set 1− Support ( Y , T ) Conviction ( X ⊆ I ⇒ Y ⊆ I , T ) = 1− Confidence ( X ⇒ Y , T ) Interpreted as the ratio of the expected frequency that X occurs without Y Measure for the frequency that the rule makes an incorrect prediction, e.g. a value of 1.2 would indicate that the rule would be incorrect 1.2 times as often if the assocation between X and Y was by chance Cost sensitive measures exist here as well ("profit" or "utility" based rule mining) E.g. Kitts et al., 2000, ExpectedProfit ( X ⊆ I ⇒ Y ⊆ I , T ) = Confidence ( X ⇒ Y , T ) ∑ i Profit ( Y ) and i IncrementalProfit ( X ⊆ I ⇒ Y ⊆ I , T ) = ( Confidence ( X ⇒ Y , T ) − P ( Y )) Profit ( Y ) ∑ i i 12

  13. The apriori algorithm Algorithm: 1. A minimum support threshold is applied to find all frequent itemsets 2. A minimum confidence threshold is applied to these frequent itemsets in order to form frequent association rules Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets ∣ I ∣−1 The set of possible itemsets is the power set over I and has size 2 (excluding the empty set which is not a valid itemset) Although the size of the power-set grows exponentially in the number of items |I|, efficient search is possible using the downward-closure property of support (or: anti-monotonicity) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent Exploiting this property, efficient algorithms can find all frequent item-sets 13

  14. The apriori algorithm A minimum support threshold is applied to find all frequent itemsets All itemsets with support >= 50% Tr. ID milk bread beer cheese wine spaghetti 101 1 1 1 0 0 0 Itemset ('milk',) has a support of: 0.6 102 0 1 1 1 1 0 103 1 1 0 1 0 1 Itemset ('bread',) has a support of: 0.8 104 0 0 0 1 1 1 Itemset ('cheese',) has a support of: 0.8 105 1 1 0 1 1 1 Itemset ('wine',) has a support of: 0.6 Itemset ('spaghetti',) has a support of: 0.6 Itemset ('milk', 'bread') has a support of: 0.6 Itemset ('bread', 'cheese') has a support of: 0.6 Itemset ('cheese', 'wine') has a support of: 0.6 Itemset ('cheese', 'spaghetti') has a support of: 0.6 ∣ I ∣−1 ... brute force leads to 2 = 64 possibilities For 100 items we'd already have 1 271 427 896 possibilities 14

  15. The apriori algorithm A minimum support threshold is applied to find all frequent itemsets Speeding this up with a "step by step" expansion: we only need to continue expanding itemsets that are above the threshold, only using items above the threshold 15

  16. The apriori algorithm A minimum support threshold is applied to find all frequent itemsets "Join and prune": an even better way (proposed by apriori) Say we want to generate candidate 3-itemsets (sets with three items) Look at previous : we only need the (3-1)-itemsets to do so, and only the ones which had enough support: {milk, bread}, {bread, cheese}, {cheese, wine}, {cheese, spaghetti} Join on self : join this set of itemset on itself to generate a list of candidates with length 3: {milk, bread} x {bread, cheese} = {milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti} Prune result : prune the candidates containing a (3-1)-subset that did not have enough support (all candidates can be pruned in this case): {milk, bread, cheese} {bread, cheese, wine} {bread, cheese, spaghetti} {cheese, wine, spaghetti} This is repeated for every step 16

  17. The apriori algorithm 17

  18. The apriori algorithm A minimum confidence threshold is applied to these frequent itemsets in order to form frequent association rules Once the frequent itemset are obtained, association rules are generated as follows For each frequent itemset I, generate all non empty subsets of I For every non, empty, non-equal subset, check its confidence value and retain those above a threshold ∀ I ∈ P ( I ) : I ≠ ∅ ∧ I ≠ I ∧ s s s Confidence ( I ⇒ I ∖ I ) > minconf → I ⇒ I ∖ I s s s s E.g. for frequent itemset {cheese, wine, spaghetti}, we'd check {cheese, wine} → {spaghetti} {cheese, spaghetti} → {wine} {wine, spaghetti} → {cheese} {cheese} → {wine, spaghetti} {spaghetti} → {cheese, wine} {wine} → {cheese, spaghetti} ... and keep those with sufficient confidence 18

  19. Extensions 19

Recommend


More recommend