Overview Department of Statistics Department of Statistics Data - - PDF document

overview
SMART_READER_LITE
LIVE PREVIEW

Overview Department of Statistics Department of Statistics Data - - PDF document

Wharton Wharton Overview Department of Statistics Department of Statistics Data Mining w Applications - Marketing: Direct mail advertising (Zahavi example) - Biomedical: finding predictive risk factors - Financial: predicting returns and


slide-1
SLIDE 1

1

Wharton

Department of Statistics

Data Mining

Bob Stine Department of Statistics www-stat.wharton.upenn.edu/~bob

Wharton

Department of Statistics 2

Overview

w Applications

  • Marketing: Direct mail advertising (Zahavi example)
  • Biomedical: finding predictive risk factors
  • Financial: predicting returns and bankruptcy

w Role of management

  • Setting goals
  • Coordinating players

w Critical stages of modeling process

  • Picking the model <-- My research interest
  • Validation

Wharton

Department of Statistics 3

Predicting Health Risk

w Who is at risk for a disease?

  • Costs
  • False positive: treat a healthy person
  • False negative: miss a person with the disease
  • Example: detect osteoporosis without need for x-ray

w What sort of predictors, at what cost?

  • Very expensive: Laboratory measurements, “genetic”
  • Expensive: Doctor reported clinical observations
  • Cheap: Self-reported behavior

w Missing data

  • Always present
  • Are records with missing data like those that are not missing?

Wharton

Department of Statistics 4

Predicting Stock Market Returns

w Predicting returns on the S&P 500 index

  • Extrapolate recent history
  • Exogenous factors

w What would distinguish a good model?

  • Highly statistically significant predictors
  • Reproduces pattern in observed history
  • Extrapolate better than guessing, hunches

w Validation

  • Test of the model yields sobering insight
slide-2
SLIDE 2

2

Wharton

Department of Statistics 5

Predicting the Market

w Build a regression model

  • Response is return on the value-weighted S&P
  • Use standard forward/backward stepwise
  • Battery of 12 predictors

w Train the model during 1992-1996

  • Model captures most of variation in 5 years of returns
  • Retain only the most significant features (Bonferroni)

w Predict what happens in 1997 w Another version in Foster, Stine & Waterman

Wharton

Department of Statistics 6

Historical patterns?

vwReturn

  • 0.06
  • 0.04
  • 0.02

0.00 0.02 0.04 0.06 0.08 92 93 94 95 96 97 98 Year

?

Wharton

Department of Statistics 7

Fitted model predicts...

  • 0.05
  • 0.00

0.05 0.10 0.15 92 93 94 95 96 97 98 Year Exceptional Feb return?

Wharton

Department of Statistics 8

What happened?

Pred Error

  • 0.15
  • 0.10
  • 0.05
  • 0.00

0.05 0.10 92 93 94 95 96 97 98 Year

Training Period

slide-3
SLIDE 3

3

Wharton

Department of Statistics 9

Claimed versus Actual Error

20 40 60 80 100 120 10 20 30 40 50 60 70 80 90 100 Complexity of Model

Actual Claimed

Squared Prediction Error

Wharton

Department of Statistics 10

Over-confidence?

w Over-fitting

  • DM model fits the training data too well – better than it can

predict when extrapolated to future.

  • Greedy model-fitting procedure

“Optimization capitalizes on chance”

w Some intuition for the phenomenon

  • Coincidences
  • Cancer clusters, the “birthday problem”
  • Illustration with an auction
  • What is the value of the coins in this jar?

Wharton

Department of Statistics 11

Auctions and Over-fitting

w Auction jar of coins to a

class of students

w Histogram shows the bids of

30 students

w Some were suspicious, but a

few were not!

w Actual value is $3.85 w Known as “Winner’s Curse” w Similar to over-fitting:

best model like high bidder

1 2 3 4 5 6 7 8 9

Wharton

Department of Statistics 12

Roles of Management

Management determines whether a project succeeds…

w Whose data is it?

  • Ownership and shared obligations/rewards

w Irrational expectations

  • Budgeting credit: “How could you miss?”

w Moving targets

  • Energy policy: “You’ve got the old model.”

w Lack of honest verification

  • Stock example… Given time, can always find a good fit.
  • Rx marketing: “They did well on this question.”
slide-4
SLIDE 4

4

Wharton

Department of Statistics 13

What are the costs?

w Symmetry of mistakes?

  • Is over-predicting as costly as under-predicting?
  • Managing inventories and sales
  • Visible costs versus hidden costs

w Does a false positive = a false negative?

  • Classification
  • Credit modeling, flagging “risky” customers
  • Differential costs for different types of errors
  • False positive: call a good customer “bad”
  • False negative: fail to identify a “bad”

Wharton

Department of Statistics 14

Back to a real application…

How can we avoid some of these problems? I’ll focus on * statistical modeling aspects (my research interest), and also * reinforce the business environment.

Wharton

Department of Statistics 15

Predicting Bankruptcy

w “Needle in a haystack”

  • 3,000,000 months of credit-card activity
  • 2244 bankruptcies
  • Best customers resemble worst customers

w What factors anticipate bankruptcy?

  • Spending patterns? Payment history?
  • Demographics? Missing data?
  • Combinations of factors?
  • Cash Advance + Las Vegas = Problem

w We consider more than 100,000 predictors!

Wharton

Department of Statistics 16

Stages in Modeling

w Having framed the problem, gotten relevant data… w Build the model

Identify patterns that predict future observations.

w Evaluate the model

When can you tell if its going to succeed…

  • During the model construction phase
  • Only incorporate meaningful features
  • After the model is built
  • Validate by predicting new observations
slide-5
SLIDE 5

5

Wharton

Department of Statistics 17

Building a Predictive Model

So many choices…

w Structure: What type of model?

  • Neural net (projection pursuit)
  • CART, classification tree
  • Additive model or regression spline (MARS)

w Identification: Which features to use?

  • Time lags, “natural” transformations
  • Combinations of other features

w Search: How does one find these features?

  • Brute force has become cheap.

Wharton

Department of Statistics 18

My Choices

w Simple structure

  • Linear regression with nonlinear via interactions
  • All 2-way and many 3-way, 4-way interactions

w Rigorous identification

  • Conservative standard error
  • Comparison of conservative t-ratio to adaptive threshold

w Greedy search

  • Forward stepwise regression
  • Coming: Dynamically changing list of features
  • Good choice affects where you search next.

Wharton

Department of Statistics 19

Bankruptcy Model: Construction

w Context

  • Identify current customers who might declare bankruptcy

w Split data to allow validation, comparison

  • Training data
  • 600,000 months with 450 bankruptcies
  • Validation data
  • 2,400,000 months with 1786 bankruptcies

w Selection via adaptive thresholding

  • Analogy: Compare sequence of t-stats to Sqrt(2 log p/q)
  • Dynamic expansion of feature space

Wharton

Department of Statistics 20

Bankruptcy Model: Fitting

w Where should the fitting process be stopped?

Residual Sum of Squares

400 410 420 430 440 450 460 470 50 100 150 Number of Predictors SS

slide-6
SLIDE 6

6

Wharton

Department of Statistics 21

Bankruptcy Model: Fitting

w Our adaptive selection procedure stops at a model

with 39 predictors.

Residual Sum of Squares

400 410 420 430 440 450 460 470 50 100 150 Number of Predictors SS

Wharton

Department of Statistics 22

Bankruptcy Model: Validation

w The validation indicates that the fit gets better while

the model expands. Avoids over-fitting.

Validation Sum of Squares

1640 1680 1720 1760 50 100 150 Number of Predictors SS

Wharton

Department of Statistics 23

Lift Chart

w Measures how well model classifies sought-for group w Depends on rule used to label customers

  • Very high probability of bankruptcy

Lots of lift, but few bankrupt customers are found.

  • Lower rule

Lift drops, but finds more bankrupt customers.

w Tie to the economics of the problem

  • Slope gives you the trade-off point

data all in bankrupt % selection DM in bankrupt % = Lift

Wharton

Department of Statistics 24

Example: Lift Chart

%Responders 0.0 0.2 0.4 0.6 0.8 1.0 10 20 30 40 50 60 70 80 90 100 % Chosen

Model Random

slide-7
SLIDE 7

7

Wharton

Department of Statistics 25

Bankruptcy Model: Lift

w Much better than diagonal!

25 50 75 100 25 50 75 100 % Contacted % Found

Wharton

Department of Statistics 26

Calibration

w Classifier assigns

Prob(“BR”) rating to a customer.

w Weather forecast w Among those classified as

2/10 chance of “BR”, how many are BR?

w Closer to diagonal is

better.

Actual 25 50 75 100 10 20 30 40 50 60 70 80 90

Wharton

Department of Statistics 27

Bankruptcy Model: Calibration

w Over-predicts risk near claimed probability 0.3.

Calibration Chart

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 Claim Actual

Wharton

Department of Statistics 28

Modeling Bankruptcy

w Automatic, adaptive selection

  • Finds patterns that predict new observations
  • Predictive, but not easy to explain

w Dynamic feature set

  • Current research
  • Information theory allows changing search space
  • Finds more structure than direct search could find

w Validation

  • Remains essential only for judging fit, reserve more for

modeling

  • Comparison to rival technology (we compared to C4.5)
slide-8
SLIDE 8

8

Wharton

Department of Statistics 29

Wrap-Up Data Mining

w Data, data, data

  • Often most time consuming steps
  • Cleaning and merging data
  • Without relevant, timely data, no chance for success.

w Clear objective

  • Identified in advance
  • Checked along the way, with “honest” methods

w Rewards

  • Who benefits from success?
  • Who suffers if it fails?