have you ever
play

Have You Ever Wondered how products are placed in supermarket - PDF document

Have You Ever Wondered how products are placed in supermarket aisles? CS 235: Had your application for a no-interest-for- Introduction to Databases 6-months Titanium credit card rejected? Puzzled over the two-hour phone call to


  1. Have You Ever … � Wondered how products are placed in supermarket aisles? CS 235: � Had your application for a no-interest-for- Introduction to Databases 6-months Titanium credit card rejected? � Puzzled over the two-hour phone call to Svetlozar Nestorov Belize on your phone bill? Lecture Notes #15 � Gazed at the sky and wondered if that bright star is a white dwarf? � Data mining has the answers!!! CS 235: Introduction to Databases 2 What is Data Mining? Data Mining Needs Databases � Finding “interesting” patterns in large � Machine learning and statistics often make amounts of data. the following assumptions: � Data mining encompasses several areas: • small amount of data (or sample) • data fits in main memory • Machine learning (AI) • CPU time is crucial • Statistics � The reality: • Databases • huge amounts data • data on secondary storage • data management (disk I/O) is crucial CS 235: Introduction to Databases 3 CS 235: Introduction to Databases 4 Data Mining Techniques Association-Rule Mining � Classification (supervised learning) � Flagship of data mining with database flavor. • Build and train classifiers (decision trees, neural nets, etc.) � Find correlations among data without � Clustering (unsupervised learning) building a complete predictive or descriptive model. • Partition the data into groups with similar characteristics. � Data-centric approach. � Sequence and stream analysis � Association rule-mining CS 235: Introduction to Databases 5 CS 235: Introduction to Databases 6 1

  2. Market Basket Data Association Rules � Consider supermarket customers. � Find rules of the form: � At the checkout each customer has a • People who buy X tend to buy Y. basket of items. � Find correlation among the contents of baskets. � The model works for many domains: + • Online/offline shopping • Web surfing • Text analysis CS 235: Introduction to Databases 7 CS 235: Introduction to Databases 8 Mythical Association Rule A Lesson in Marketing � Suppose we know that people buy bread and milk frequently. So what? • Stock them together. • Stock them apart. • Run sales on one and up the price of the other. � Amazon’s recommendations are based on association rules. • Order size went up 20% in the first week after recommendations were introduced. CS 235: Introduction to Databases 9 CS 235: Introduction to Databases 10 Schema of Market Basket Data Market Basket Example basketID item � Several models possible depending on the 11111 beer application. � Simplest, most general schema: 11111 chips Baskets(basketID, item) 11111 salsa � Applicable to many different scenarios, 22222 vodka online and offline. 22222 caviar CS 235: Introduction to Databases 11 CS 235: Introduction to Databases 12 2

  3. Support and Confidence Thresholds � Formally, we associate two numbers with � Find association rules with high support every rule: and high confidence. • support � Typically, high support means > 0.1% and • confidence high confidence means > 50%. � Example: Diapers → Beers • Thresholds depend on the application. • Support is the fraction of all baskets that contain both beer and diapers. • Confidence is the fraction of baskets which contain diapers that also contain beers. CS 235: Introduction to Databases 13 CS 235: Introduction to Databases 14 Main Challenge Two-Phase Approach � Too many item combinations: � Phase 1: Find all itemsets with high support. • 100s of thousands of items • millions of transactions • These itemsets are called frequent. � Direct approach too slow: � Phase 2: Construct rules with high confidence. • 100 million baskets, 20 items/basket � The computational cost of phase 1 • 19 billion pairs, 100+ billion triples,… dominates the total cost. � Focus on finding frequent itemsets. CS 235: Introduction to Databases 15 CS 235: Introduction to Databases 16 Find All Frequent Pairs The A-Priori Technique � Write query in SQL: � Key observation: a pair of items is frequent only if each item is frequent. • If {bread, cheese} is frequent then {bread} and {cheese} must be frequent. � Levelwise pruning: • Consider {bread, milk, cheese} only if {bread, milk}, {bread, cheese}, {milk, cheese} are frequent CS 235: Introduction to Databases 17 CS 235: Introduction to Databases 18 3

  4. A-Priori in SQL Extending Association Rules INSERT INTO Baskets1(bid, item) � Causality vs. association SELECT * FROM Baskets • much trickier WHERE item IN ( • hidden variables outside the domain SELECT item � More detailed associations: FROM Baskets • Find items that are bought together GROUP BY item frequently, in a particular region, in a HAVING COUNT(*) >= s particular month. ); • Additional information is already available at � Rewrite join using Basket1 instead of Basket. the data warehouse. CS 235: Introduction to Databases 19 CS 235: Introduction to Databases 20 Need for Data Warehousing Example Data Warehouse Product � Integrated, company-wide view of high-quality Calendar ProductKey (PK) SKU CalendarKey (PK) information. Brand FullDate ... Month � Separation of operational and analytical systems Sales fact table ... and data. CalendarKey (FK) ProductKey (FK) CustomerKey (FK) LocationKey (FK) TransactionID Location Customer LocationKey (PK) CustomerKey (PK) StoreID CustomerID Region Gender City Zip ... ... CS 235: Introduction to Databases 21 CS 235: Introduction to Databases 22 Operational vs. Analytical Data Application vs. Subject Oriented Data Differences Application: Typical Time-Horizon: Days/Months Typical Time-Horizon: Years Health Club Members-Visit Database Subject : Detailed Summarized (and/or Detailed) Health Club Revenue HEALTHCLUBMEMEBRS MembId Name MembLevel DatePayed Current Values over time (Snapshots) REVENUE 111 Joe A 01/01/2000 Rid Date GeneratedBy Amount 222 Sue B 01/01/2000 7235 01/01/2000 NonMember $15 Technical Differences 333 Pat A 01/01/2000 7236 01/01/2000 Member $100 … … … … Can be Updated Read (and Append) Only 7237 01/01/2000 Member $50 Control of Update: Major Issue Control of Update: No Issue 7238 01/01/2000 Member $100 DAILYVISITSFROMNONMEMBERS 7239 02/01/2000 NonMember $10 Trid VisitType VisitDate Small Amounts used in a Process Large Amounts used in a Process 7240 02/01/2000 NonMember $15 11xx22 YP 01/01/2000 Non-Redundant Redundancy not an Issue … … … … 11xx23 NP 02/01/2000 11xx24 YP 02/01/2000 High frequency of Access Low/Modest frequency of Access … … … Purpose Differences MEMBRSHPLEVELS ID Type Fee For “Clerical Community” For “Managerial Community” A Gold $100 Supports Day-to-Day Operations Supports Managerial Needs B Basic $50 Application Oriented Subject Oriented VISITLEVELS ID Type Fee YP With Pool Usage $15 CS 235: Introduction to Databases 23 NP Without Pool Usage CS 235: Introduction to Databases $10 24 4

Recommend


More recommend