Wharton Wharton Overview Department of Statistics Department of Statistics Data Mining w Applications - Marketing: Direct mail advertising (Zahavi example) - Biomedical: finding predictive risk factors - Financial: predicting returns and bankruptcy Bob Stine w Role of management Department of Statistics - Setting goals - Coordinating players w Critical stages of modeling process www-stat.wharton.upenn.edu/~bob - Picking the model <-- My research interest - Validation 2 Wharton Wharton Predicting Health Risk Predicting Stock Market Returns Department of Statistics Department of Statistics w Who is at risk for a disease? w Predicting returns on the S&P 500 index - Costs - Extrapolate recent history • False positive: treat a healthy person - Exogenous factors • False negative: miss a person with the disease - Example: detect osteoporosis without need for x-ray w What would distinguish a good model? w What sort of predictors, at what cost? - Highly statistically significant predictors - Very expensive: Laboratory measurements, “genetic” - Reproduces pattern in observed history - Expensive: Doctor reported clinical observations - Cheap: Self-reported behavior - Extrapolate better than guessing, hunches w Missing data w Validation - Always present - Test of the model yields sobering insight - Are records with missing data like those that are not missing? 3 4 1
Wharton Wharton Predicting the Market Historical patterns? Department of Statistics Department of Statistics w Build a regression model 0.08 - Response is return on the value-weighted S&P 0.06 - Use standard forward/backward stepwise 0.04 - Battery of 12 predictors 0.02 vwReturn w Train the model during 1992-1996 ? 0.00 - Model captures most of variation in 5 years of returns -0.02 - Retain only the most significant features (Bonferroni) -0.04 w Predict what happens in 1997 -0.06 w Another version in Foster, Stine & Waterman 92 93 94 95 96 97 98 Year 5 6 Wharton Wharton Fitted model predicts... What happened? Department of Statistics Department of Statistics 0.15 0.10 Exceptional Feb return? 0.05 0.10 -0.00 0.05 Pred Error -0.05 -0.00 Training Period -0.10 -0.05 -0.15 92 93 94 95 96 97 98 92 93 94 95 96 97 98 Year Year 7 8 2
Wharton Wharton Claimed versus Actual Error Over-confidence? Department of Statistics Department of Statistics w Over-fitting 120 - DM model fits the training data too well – better than it can Actual 100 predict when extrapolated to future. Squared Prediction - Greedy model-fitting procedure Error 80 “Optimization capitalizes on chance” 60 w Some intuition for the phenomenon - Coincidences 40 • Cancer clusters, the “birthday problem” Claimed - Illustration with an auction 20 0 10 20 30 40 50 60 70 80 90 100 • What is the value of the coins in this jar? Complexity of Model 9 10 Wharton Wharton Auctions and Over-fitting Roles of Management Department of Statistics Department of Statistics w Auction jar of coins to a Management determines whether a project succeeds… 9 class of students w Whose data is it? 8 w Histogram shows the bids of - Ownership and shared obligations/rewards 7 30 students w Irrational expectations 6 w Some were suspicious, but a - Budgeting credit: “How could you miss?” few were not! 5 w Moving targets w Actual value is $3.85 4 - Energy policy: “You’ve got the old model.” w Known as “ Winner’s Curse” 3 w Lack of honest verification w Similar to over-fitting: - Stock example… Given time, can always find a good fit. 2 best model like high bidder - Rx marketing: “They did well on this question.” 1 11 12 3
Wharton Wharton What are the costs? Back to a real application… Department of Statistics Department of Statistics w Symmetry of mistakes? - Is over-predicting as costly as under-predicting? How can we avoid some of these problems? - Managing inventories and sales - Visible costs versus hidden costs I’ll focus on w Does a false positive = a false negative? - Classification * statistical modeling aspects (my research interest), • Credit modeling, flagging “risky” customers and also - Differential costs for different types of errors * reinforce the business environment. • False positive: call a good customer “bad” • False negative: fail to identify a “bad” 13 14 Wharton Wharton Predicting Bankruptcy Stages in Modeling Department of Statistics Department of Statistics w “Needle in a haystack” w Having framed the problem, gotten relevant data… - 3,000,000 months of credit-card activity w Build the model - 2244 bankruptcies Identify patterns that predict future observations. - Best customers resemble worst customers w Evaluate the model w What factors anticipate bankruptcy? When can you tell if its going to succeed… - Spending patterns? Payment history? - During the model construction phase - Demographics? Missing data? • Only incorporate meaningful features - Combinations of factors? - After the model is built • Cash Advance + Las Vegas = Problem • Validate by predicting new observations w We consider more than 100,000 predictors! 15 16 4
Wharton Wharton Building a Predictive Model My Choices Department of Statistics Department of Statistics So many choices… w Simple structure - Linear regression with nonlinear via interactions w Structure: What type of model? - All 2-way and many 3-way, 4-way interactions • Neural net (projection pursuit) • CART, classification tree w Rigorous identification • Additive model or regression spline (MARS) - Conservative standard error w Identification: Which features to use? - Comparison of conservative t-ratio to adaptive threshold • Time lags, “natural” transformations w Greedy search • Combinations of other features - Forward stepwise regression w Search: How does one find these features? - Coming: Dynamically changing list of features • Brute force has become cheap. • Good choice affects where you search next. 17 18 Wharton Wharton Bankruptcy Model: Fitting Bankruptcy Model: Construction Department of Statistics Department of Statistics w Context w Where should the fitting process be stopped? - Identify current customers who might declare bankruptcy w Split data to allow validation, comparison Residual Sum of Squares - Training data 470 • 600,000 months with 450 bankruptcies 460 - Validation data 450 440 SS • 2,400,000 months with 1786 bankruptcies 430 420 410 w Selection via adaptive thresholding 400 - Analogy: Compare sequence of t-stats to Sqrt(2 log p/q) 0 50 100 150 Number of Predictors - Dynamic expansion of feature space 19 20 5
Wharton Wharton Bankruptcy Model: Fitting Bankruptcy Model: Validation Department of Statistics Department of Statistics w The validation indicates that the fit gets better while w Our adaptive selection procedure stops at a model the model expands. Avoids over-fitting. with 39 predictors. Validation Sum of Squares Residual Sum of Squares 1760 470 460 1720 450 440 SS SS 430 1680 420 410 1640 400 0 50 100 150 0 50 100 150 Number of Predictors Number of Predictors 21 22 Wharton Wharton Lift Chart Example: Lift Chart Department of Statistics Department of Statistics w Measures how well model classifies sought-for group 1.0 Model 0.8 % bankrupt in DM selection Lift = % bankrupt in all data %Responders 0.6 Random w Depends on rule used to label customers 0.4 - Very high probability of bankruptcy Lots of lift, but few bankrupt customers are found. 0.2 - Lower rule Lift drops, but finds more bankrupt customers. 0.0 0 10 20 30 40 50 60 70 80 90 100 w Tie to the economics of the problem - Slope gives you the trade-off point % Chosen 23 24 6
Wharton Wharton Bankruptcy Model: Lift Calibration Department of Statistics Department of Statistics w Much better than diagonal! w Classifier assigns 100 Prob(“BR”) 100 rating to a customer. 75 w Weather forecast Actual 50 75 % Found w Among those classified as 25 50 2/10 chance of “BR”, 0 how many are BR? 10 20 30 40 50 60 70 80 90 25 w Closer to diagonal is 0 better. 0 25 50 75 100 % Contacted 25 26 Wharton Wharton Bankruptcy Model: Calibration Modeling Bankruptcy Department of Statistics Department of Statistics w Over-predicts risk near claimed probability 0.3. w Automatic, adaptive selection - Finds patterns that predict new observations Calibration Chart - Predictive, but not easy to explain 1.2 w Dynamic feature set 1 - Current research 0.8 Actual - Information theory allows changing search space 0.6 0.4 - Finds more structure than direct search could find 0.2 w Validation 0 - Remains essential only for judging fit, reserve more for 0 0.2 0.4 0.6 0.8 modeling Claim - Comparison to rival technology (we compared to C4.5) 27 28 7
Wharton Wrap-Up Data Mining Department of Statistics w Data, data, data - Often most time consuming steps • Cleaning and merging data - Without relevant, timely data, no chance for success. w Clear objective - Identified in advance - Checked along the way, with “honest” methods w Rewards - Who benefits from success? - Who suffers if it fails? 29 8
Recommend
More recommend