why data mining
play

Why data mining? The world is awash with digital data; trillions - PowerPoint PPT Presentation

Why data mining? The world is awash with digital data; trillions of gigabytes and growing A trillion gigabytes is a zettabyte , or 1 000 000 000 000 000 000 000 bytes Computational Thinking ct.cs.ubc.ca Why data mining? More and more,


  1. Why data mining? • The world is awash with digital data; trillions of gigabytes and growing • A trillion gigabytes is a zettabyte , or 1 000 000 000 000 000 000 000 bytes Computational Thinking ct.cs.ubc.ca

  2. Why data mining? More and more, businesses and institutions are using data mining to make decisions, classifications, diagnoses, and recommendations that affect our lives Computational Thinking ct.cs.ubc.ca

  3. An example data mining quote from the NY Times article “We have the capacity to send every customer an ad booklet, specifically designed for them, that says, ‘Here’s everything you bought last week and a coupon for it,’ ” one Target executive told me. ‘We do that for grocery products all the time.’ But for pregnant women, Target’s goal was selling them baby items they didn’t even know they needed yet.” Computational Thinking ct.cs.ubc.ca

  4. As we discussed, cookies tell information about you. But how do pages that you’ve visited predict the future? Computational Thinking ct.cs.ubc.ca

  5. Data Mining • Data mining is the process of looking for patterns in large data sets • There are many different kinds for many different purposes • We’ll do an in depth exploration of two of them Computational Thinking ct.cs.ubc.ca

  6. Data mining for classification Recall our loan application example Credit Rating Goal: given colours, credit ratings, and past rates of successfully paying back loans, decide to grant a loan or not. Computational Thinking ct.cs.ubc.ca

  7. Data mining for classification • In the loan strategy example, we focused on fairness of different classifiers, but we didn’t focus much on how to build a classifier • Today you’ll learn how to build decision tree classifiers for simple data mining scenarios Computational Thinking ct.cs.ubc.ca

  8. A rooted tree in computer science Before we get to decision trees, we need to define a tree Computational Thinking ct.cs.ubc.ca

  9. A rooted tree in computer science A tree is a collection of nodes such that • one node is the designated root A • a node can have zero or more B C D children ; a node with zero children is a leaf E F G H • all non-root nodes have a single parent I J K L • edges denote parent- child relationships M O • nodes and/or edges may be labeled by data Computational Thinking ct.cs.ubc.ca

  10. A rooted tree in computer science Often but not always drawn with root on top Casaurius Dromaius Apteryx owenii Apteryx haastii Asteryx mantelli Apteryx australis Aepyornithidae Struthio Rhea Pterocnernia Megalapteryx Dinornis Pachyornis Emeus Anomalopteryx Computational Thinking ct.cs.ubc.ca

  11. Decision trees: trees whose node labels are attributes, edge labels are conditions Computational Thinking ct.cs.ubc.ca

  12. Decision trees: trees whose node labels are attributes, edge labels are conditions Enzyme Immunoassay Yes No Consider Test Symptom Alternative Length Diagnosis ≤ 30 days > 30 days IgM and IgG IgG Western Western Blot Blot ONLY Decision tree for Lyme Disease diagnosis Computational Thinking ct.cs.ubc.ca

  13. Decision trees: trees whose node labels are attributes, edge labels are conditions Computational Thinking https://gbr.pepperdine.edu/2010/08/how-gerber-used-a- ct.cs.ubc.ca decision-tree-in-strategic-decision-making/

  14. Back to our example. We may want to make a tree saying when to approve or deny a loan Credit Rating Goal: given colours, credit ratings, and past rates of successfully paying back loans, decide to grant a loan or not. Computational Thinking ct.cs.ubc.ca

  15. Decision trees: trees whose node labels are attributes, edge labels are conditions A decision tree for max profit loan strategy colour orange blue credit credit rating rating ≥ 61 < 61 ≥ 50 < 50 approve deny approve deny (Note that some worthy applicants are denied loans, while other unworthy ones get loans) Computational Thinking ct.cs.ubc.ca

  16. Exercise: Construct the decision tree for the “Group Unaware” loan strategy Credit Rating Goal: given colours, credit ratings, and past rates of successfully paying back loans, decide to grant a loan or not. Computational Thinking ct.cs.ubc.ca

  17. Sample Decision Tree for “Group Unaware” strategy A decision tree for max profit loan strategy credit rating ≥ 55 < 55 approve deny (Note that some worthy applicants are denied loans, while other unworthy ones get loans) Computational Thinking ct.cs.ubc.ca

  18. Building decision trees from training data • Should you get an ice cream? • You might start out with the following data • You might build a decision tree that looks like this: attributes Weather Wallet Ice Wallet Cream? Empty Full Great Empty No Weather Nasty Empty No No Great Full Yes Nasty Great Okay Okay Full Yes Yes No Yes Nasty Full No conditions Computational Thinking ct.cs.ubc.ca

  19. Deciding which nodes go where: A decision tree construction algorithm • Top-down tree construction • At start, all examples are at the root. • Partition the examples recursively by choosing one attribute each time. • In deciding which attribute to split on, one common method is to try to reduce entropy – i.e., each time you split, you should make the resulting groups more homogenous. The more you reduce entropy, the higher the information gain. Computational Thinking ct.cs.ubc.ca

  20. This was, of course, a simple example • In this example, the algorithm found the tree with the smallest number of nodes • We were given the attributes and conditions • A simplistic notion of entropy worked (a more sophisticated notion of entropy is typically used to determine which attribute to split on) Computational Thinking ct.cs.ubc.ca

  21. This was, of course, a simple example • In more complex examples, like the loan application example • We may not know which conditions or attributes are best to use • The final decision may not be correct in every case (e.g., given two loan applicants with the same colour and credit rating, one may be credit worthy while the other is not) • Even if the final decision is always correct, the tree may not be of minimum size Computational Thinking ct.cs.ubc.ca

  22. Coding up a decision tree classifier Outlook sunny rainy overcast Yes Windy Humidity normal high false true No Yes No Yes Computational Thinking ct.cs.ubc.ca

  23. Coding up a decision tree classifier Outlook sunny overcast Yes Humidity normal high No Yes Can you see the relationship between the hierarchical tree structure and the hierarchical nesting of “if” statements? Computational Thinking ct.cs.ubc.ca

  24. Coding up a decision tree classifier Outlook rainy Windy true false Yes No Can you extend the code to handle the “rainy” case? Computational Thinking ct.cs.ubc.ca

  25. Greed, for lack of a better word, is good • The algorithm that we used to create the decision tree is a greedy algorithm • In a greedy algorithm, you make a choice that’s the optimal choice for now and hope that it’s the optimal choice in the long run • Sometimes it’s the best in the long run, sometimes it’s not. • In building a decision tree, greedy will not always be optimal – but it’s pretty good, and it’s much faster than an optimal approach • In some problems you can prove that greedy can find the best solution! Computational Thinking ct.cs.ubc.ca

  26. Popping back up a level… The second type of data mining that we will look at in detail involves putting similar items together in groups Computational Thinking ct.cs.ubc.ca

  27. What is clustering? Clustering is partitioning a set of items into subgroups so as to ensure certain measures of quality (e.g., “similar” items are grouped together) Computational Thinking ct.cs.ubc.ca

  28. Why cluster? Netflix movie recommendations The movies recommended to you are based on those that others in your clusters watch or recommend. “We used to be more naive. We used to overexploit individual signals,” says Yellin. “If you watched a romantic comedy, years ago we would have overexploited that. The whole top of your screen would be more romantic comedies. Not a lot of variety. And that gets you into a quick cul-de-sac of too much content around one area.” Computational Thinking https://www.wired.com/2016/03/netflixs- ct.cs.ubc.ca grand-maybe-crazy-plan-conquer-world/

  29. Why cluster? Netflix movie recommendations A related problem: how to predict how users will rate a new movie? Netflix has a competition with a 1 million dollar prize for algorithms that do this well. They provide training data: 100 million ratings generated by over 480 thousand users on over 17 thousand movies. Competitors use clustering (among other techniques) in their solutions. Computational Thinking ct.cs.ubc.ca

  30. Why cluster? Breast cancer treatment Computational Thinking ct.cs.ubc.ca

  31. First, let’s define Gene Expression Computational Thinking http://learn.genetics.utah.edu/content/science/expression/ ct.cs.ubc.ca

Recommend


More recommend