Program for Today • Rule induction CSEP 546: Data Mining – Propositional – First-order • First project Instructor: Pedro Domingos Rule Induction 1
2
3
4
5
First Project: Clickstream Mining Overview The Gazelle Site • Gazelle.com was a legwear and legcare • The Gazelle site web retailer. • Data collection • Soft-launch: Jan 30, 2000 • Data pre-processing • Hard-launch: Feb 29, 2000 with an Ally McBeal TV ad on 28th • KDD Cup and strong $10 off promotion • Hints and findings • Training set: 2 months • Test sets: one month (split into two test sets) 6
Data Collection Data Pre-Processing • Site was running Blue Martini’s Customer • Acxiom enhancements: age, gender, marital status, Interaction System version 2.0 vehicle type, own/rent home, etc. • Keynote records (about 250,000) removed. • Data collected includes: They hit the home page 3 times a minute, 24 hours. – Clickstreams • Personal information removed, including: • Session: date/time, cookie, browser, visit count, referrer Names, addresses, login, credit card, phones, host name/IP, • Page views: URL, processing time, product, assortment (assortment is a collection of products, such as back to school) verification question/answer. Cookie, e-mail obfuscated. – Order information • Test users removed based on multiple criteria • Order header: customer, date/time, discount, tax, shipping. (e.g., credit card) not available to participants • Order line: quantity, price, assortment • Original data and aggregated data (to session – Registration form: questionnaire responses level) were provided KDD Cup Questions KDD Cup Statistics 1. Will visitor leave after this page? • 170 requests for data 2. Which brands will visitor view? • 31 submissions 3. Who are the heavy spenders? • 200 person/hours per submission (max 900) 4. Insights on Question 1 • Teams of 1-13 people (typically 2-3) 5. Insights on Question 2 Evaluation Criteria Algorithms Tried vs Submitted 20 18 16 14 • Accuracy (or score) was measured for the two Entries 12 Tried 10 questions with test sets Submitted 8 6 4 • Insight questions judged with help of retail experts 2 0 from Gazelle and Blue Martini Naïve Bayes Decision Trees Nearest Neighbor Association Rules Decision Rules Boosting Sequence Analysis Neural Network Logistic Regression SVM Linear Regression Genetic Programming Clustering Bayesion Belief Net Bagging Decision Table Markov Models • Created a list of insights from all participants – Each insight was given a weight – Each participant was scored on all insights Algorithm Decision trees most widely tried and by far the – Additional factors: presentation quality, correctness most commonly submitted Note: statistics from final submitters only 7
Question: Who Will Leave Insight: Who Leaves • Given set of page views, will visitor view another page on site or leave? • Crawlers, bots, and Gazelle testers Hard prediction task because most sessions are of length 1. Gains chart for sessions longer than 5 is excellent. – Crawlers hitting single pages were 16% of sessions – Gazelle testers: distinct patterns, referrer file://c:\... Cumulative Gains Chart for Sessions >= 5 Clicks 100.00% • Referring sites: mycoupons have long sessions, The 10% highest scored 90.00% sessions account for 43% of target. Lift=4.2 shopnow.com are prone to exit quickly 80.00% 70.00% • Returning visitors' prob. of continuing is double 60.00% 1st % continue 2nd 50.00% • View of specific products (Oroblue,Levante) Random Optimal 40.00% causes abandonment - Actionable 30.00% 20.00% • Replenishment pages discourage customers. 10.00% 32% leave the site after viewing them - Actionable 0.00% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% X Insight: Who Leaves (II) Insight: Who Leaves (III) • Probability of leaving decreases with page views Many many “discoveries” are simply explained by this. E.g.: “viewing 3 different products implies low abandonment” • People who register see 22.2 pages on average • Aggregated training set contains clipped sessions compared to 3.3 (3.7 without crawlers) Many competitors computed incorrect statistics • Free Gift and Welcome templates on first three Abandonment ratio pages encouraged visitors to stay at site 100.00% 90.00% • Long processing time (> 12 seconds) implies high 80.00% 70.00% abandonment - Actionable Percent abandonment 60.00% Unclipped • Users who spend less time on the first few pages 50.00% Training Set 40.00% (session time) tend to have longer session lengths 30.00% 20.00% 10.00% 0.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Session length Time is a major factor Question: “Heavy” Spenders Total Sales, Discounts, and "Heavy Spenders" 2. Ally 5000 100.00% Discounts greater McBeal 4500 than order amount 90.00% • Characterize visitors who spend more than $12 on ad & (after discount) 4000 80.00% No data $10 off an average order at the site 3500 70.00% promotion • Small dataset of 3,465 purchases /1,831 customers 3000 60.00% $ 2500 50.00% • Insight question - no test set 2000 40.00% 3. Steady state 1. Soft Launch • Submission requirement: 1500 30.00% 1000 20.00% – Report of up to 1,000 words and 10 graphs 500 10.00% – Business users should be able to understand report 0 0.00% – Observations should be correct and interesting 1/27/00 2/3/00 2/10/00 2/17/00 2/24/00 3/2/00 3/9/00 3/16/00 3/23/00 3/30/00 average order tax > $2 implies heavy spender Order date is not interesting nor actionable Percent heavy Discount Order amount 8
Insights (II) Insights (III) Referring site traffic changed dramatically over time. • Factors correlating with heavy purchasers: Graph of relative percentages of top 5 sites – Not an AOL user (defined by browser) (browser window too small for layout - poor site design) Top Referrers MyCoupons.com 100% 6000 – Came to site from print-ad or news, not friends & family Winnie- Cooper (broadcast ads vs. viral marketing) 5000 80% Note spike – Very high and very low income ShopNow.com Yahoo searches for THONGS in traffic 4000 Percent of top referrers and Companies/Apparel/Lingerie segment – Older customers (Acxiom) 60% Target 3000 – High home market value, owners of luxury vehicles (Acxiom) FashionMall.com 40% 2000 – Geographic: Northeast U.S. states 20% – Repeat visitors (four or more times) - loyalty, replenishment 1000 – Visits to areas of site - personalize differently 0% 0 2/2/00 2/4/00 2/6/00 2/8/00 2/10/00 2/12/00 2/14/00 2/16/00 2/18/00 2/20/00 2/22/00 2/24/00 2/26/00 2/28/00 3/1/00 3/3/00 3/5/00 3/7/00 3/9/00 3/11/00 3/13/00 3/15/00 3/17/00 3/19/00 3/21/00 3/23/00 3/25/00 3/27/00 3/29/00 3/31/00 (lifestyle assortments, leg-care vs. leg-ware) Session date Fashion Mall Yahoo ShopNow MyCoupons Winnie-cooper Total from top referrers Insights (IV) Common Mistakes • Referrers - establish ad policy based on conversion • Insights need support rates, not clickthroughs Rules with high confidence are meaningless when they – Overall conversion rate: 0.8% (relatively low) apply to 4 people – MyCoupons had 8.2% conversion rate, but low spenders • Dig deeper – FashionMall and ShopNow brought 35,000 visitors Many “interesting” insights with interesting Only 23 purchased (0.07% conversion rate!) explanations were simply identifying periods of – What about Winnie-Cooper? the site. For example: Winnie Cooper is a 31-year-old guy who wears – “93% of people who responded that they are purchasing pantyhose and has a pantyhose site. for others are heavy purchasers.” 8,700 visitors came from his site (!). True, but simply identifying people who registered prior Actions: to 2/28, before the form was changed. • Make him a celebrity, interview him about – Similarly, “presence of children" (registration form) how hard it is for men to buy in stores implies heavy spender. • Personalize for XL sizes Example Question: Brand View • Agreeing to get e-mail in registration was claimed • Given set of page views, which product brand to be predictive of heavy spender will visitor view in remainder of the session? • It was mostly an indirect predictor of time (Hanes, Donna Karan, American Essentials, or none) (Gazelle changed default for on 2/28 and back on 3/16) • Good gains curves for long sessions Send-email versus heavy-spender (lift of 3.9, 3.4, and 1.3 for three brands at 10% of data). 100.00% • Referrer URL is great predictor 90.00% 80.00% – FashionMall, Winnie-Cooper are referrers for Hanes, Donna 70.00% Karan - different population segments reach these sites 60.00% Percent heavy 50.00% – MyCoupons, Tripod, DealFinder are referrers for American Percent e-mail 40.00% Essentials - AE contains socks, excellent for coupon users 30.00% 20.00% • Previous views of a product imply later views 10.00% • Few realized Donna Karan only available > Feb 26 0.00% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 / / / / / / / / / 1 7 4 1 8 6 3 0 7 3 2 / 1 2 2 3 / 1 2 2 1 / 2 / 2 / 2 / 3 / 3 / 3 / 9
Project • Implement decision tree learner • Apply to first question (Who leaves?) • Improve accuracy by refining data • Report insights • Good luck and have fun! 10
Recommend
More recommend