Learning Treed Generalized Linear Models Hugh Chipman, University of - PowerPoint PPT Presentation

Learning Treed Generalized Linear Models Hugh Chipman, University of Waterloo Joint work with Ed George (U. Pennsylvania) and Rob McCulloch (U. Chicago) Papers and software available online http://www.stats.uwaterloo.ca/ ∼ hachipman/ Relevant online papers: • “Bayesian Treed Generalized Linear Models”, by Chipman, George, and McCulloch • “A Bayesian Treed Model of Online Purchasing Behavior Using In-Store Navigational Clickstream.”, Moe, Chipman, George, and McCulloch (for the marketing example) • Chipman, George, and McCulloch, (2002) ”Bayesian Treed Models”, Machine Learning, 48, 299- 320. 1

Marketing example: On-line retailing (Joint work with Moe, U Texas @ Austin) • Potential customers visit an online store (website) • Each person’s navigation path generates various customer character- istics (e.g. number of product pages, total number of pages, average time per page, etc.) • (Binary) response variable: Does the customer buy? ⇒ Classification.... • ... But it’s important to understand the factors that lead to buying. • 23 variables, 34,585 observations (from a 6 month period). • Only 604 out of 34,585 observations are “buyers” (under 2%). 2

Basic Tree: (e.g. CART,Classification and Regression Trees, Breiman et. al. 1984) total.pages < 5.5 | • Small tree for illustration. • Greedy forward stepwise search, then backwards time.page < 5 total.pages < 12.5 pruning • Usually choose tree by cross- num.past.visits < 0.5 validation. num.past.visits < 0.5 • Constant prediction in each p=0 p=0.147 terminal node. p=0.003 p=0.024 p=0.022 p=0.090 3

Models in terminal nodes (Chipman, George and McCulloch 2002, Chaudhuri et. al. 1994, Grimshaw & Alexander 1998, Jordan and Jacobs 1994 • Linear/generalized linear models in terminal nodes: E ( Y ) = g ( β 03 + β 13 X 1 + . . . + β p 3 X p ) Var(Y) = σ 2 3 E ( Y ) = g ( β 02 + β 12 X 1 + . . . + β p 2 X p ) Var( Y ) = σ 2 2 E ( Y ) = g ( β 01 + β 11 X 1 + . . . + β p 1 X p ) Var( Y ) = σ 2 1 4

Most of the talk is about a (Bayesian) method for fitting treed GLMs But first, “Why?” : • Flexibility: Adaptive nature of trees + piecewise linear model. • Simplicity: – It gives smaller trees. – Conventional linear model is a special case (single node tree). • Easier to rank individuals (no ties in predictions). • Other enhancements of Bayesian approach: – Probability distribution on the space of trees – Stochastic search for good trees. 5

Another example: What gets your articles cited? • McGinnis, Allison, Long (1982) examined careers of 557 biochemists. • Question: what influences later research productivity? • Response: number of citations in years 8, 9 & 10 after Ph.D. • Predictors (seven): – number of articles in 3 years before Ph.D. awarded – Married @ Ph.D.? – Age @ Ph.D. – Postdoc? – Agricultural college? – Quality measures of grad/undergrad schools • Poisson model seems natural since citations are counts. 6

The rest of the talk: To do a Bayesian approach, we need 1. Priors: specification can be difficult. 2. Posteriors: Calculation involves: • Approximations for the posterior in the GLM case. • Algorithm to search for high posterior trees. Bayesian approach to CART originally in Chipman, George, & McCulloch (1998) and Denison, Mallick, & Smith (1998). After covering the Bayesian approach, we’ll return to the examples. After that I’ll mention some recent work on Boosting. 7

Priors Prior for Θ is π (Θ , T ) = π (Θ | T ) π ( T ) Prior on T specified in terms of a process for growing trees. T θ 3 θ 1 θ 2 Θ specified conditional on T 8

Priors Prior for Θ is π (Θ , T ) = π (Θ | T ) π ( T ) Prior on T specified in terms of a process for growing trees. T θ i = ( µ i , σ i ) for regression tree θ 3 (identifies shifts in µ and σ ) θ i = P ( Y = class j ) for classification tree θ 1 θ 2 θ i = ( β i , σ i ) for Generalized Linear Model Θ specified (Regression coefficients and dispersion) conditional on T 9

Priors for θ i = ( β i , σ i ), conditional on the tree T β i | σ i ∼ N (0 , σ 2 c 2 I ) σ i ∼ Inverse Gamma( ν, λ ) Independent across terminal nodes i = 1 , ..., b Different mean/variance for β possible, but this reasonable if X ’s are scaled. Choice of c is quite important: • If π ( β i | σ i ) is too informative ( c small), you’ll shrink β ’s too much. • If π ( β i | σ i ) is too vague ( c large), you’ll favour the simple tree too much. • Experiments in the regression case suggest 1 ≤ c ≤ 5 is reasonable. Less clear in the GLM case. 10

Posterior distributions P ( T, Θ | Y ) = L ( T, Θ) π (Θ | T ) π ( T ) ∝ L ( T, Θ) π (Θ | T ) π ( T ) P ( Y ) L = likelihood = P ( Y | Θ , T ) Would like to integrate out the terminal node parameters (Θ). � P ( T | Y ) ∝ π ( T ) L ( T, Θ) π (Θ | T ) d Θ This tells us what trees are most probable given the data. Note that since we assume independent observations Y 1 , . . . , Y n , these calculations can be done separately in each terminal node. Analytic solution available for linear regression with Gaussian errors, an approximation (Laplace approx.) is necessary for GLMs. 11

Finding T with large posterior probability • The space of trees is enormous. ⇒ We need to find T ’s that have large posterior probability without evaluating P ( T | Y ) for all possible T . • Greedy search? • Instead use the Metropolis-Hastings algorithm to sample from the posterior of trees. ⇒ Stochastic search guided by posterior. • P ( T | Y ) (up to a normalizing constant) can be used both in the Metropolis-Hastings algorithm, and to rank all trees sampled so far. 12

A posterior distribution on trees... This sounds simple, but there are problems. • Many local maxima - even MH tends to gravitate toward one mode. Solution: Restart the MH algorithm repeatedly to find different modes. • Posterior on individual trees is diluted by the prior. Example: Can split at X 1 = 1 , 2 or X 2 = 1 , 2 , ..., 100. Prior mass for splits on X 2 is 1/50 of mass for splits on X 1 . Solution: Don’t use the posterior to rank individual trees. Either look at likelihood or sum the posterior over groups of trees. • A forest of trees: Many different trees can fit the same dataset well. Solution: techniques to sort through the forest and identify groups of similar and different trees. 13

✡ ✎ ✤ ✖ ✪ ✌ ✟ ✒ ✣ ✡ ✡ ✤ ✫ ✫✕ ✫ ☛ ☛ ☛ ✡ ☛ ☛ ✒ ☛ ☛ ☛ ☛ ✬ ✙ ✝ ✟ ☛ ✜ ✣ ✛ ✘ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✧ ☛ ✡ ✡ ✡ ★ ✧ ✧ ✓ ✘ ✂ ✌ ✒ ✎ � ★ ✓ ✧ ☛ ✩ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✝✞ ☛ ☛ ☛ ☛ ☛ ☛ ✜ ✚✛ ☛ ☛ ✪ ☛ ✟ ✜ ✣ ✛ ✘ ☛ ☛ ☛ ✝ ☛ ✡ ✡ ★ ✪ ✌ ✒ ✎ ✒ ✙ ✖ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✬ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✫ ✂ ☛ ☛ ✔ ✩ ✕ ✩ ✓ ✖ ✓ ☛ ✥ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✫ ☎ ✦ ☛ ✌ ✟ ✒ ✣ ✎ ✡ ☛ ✒ ☛ � ✫ ✎ ✝ ✍ ✌ ✠ ☛ ☎ ✝ ✫ ☛ ✖ ✕ ✔✕ ✓ ☛ ☛ ☛ ☛ ☎ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✗ ✡ ☛ ✌ ☛ ☛ ☛ ☛ ✜ ✚✛ ✝✞ ✘ ✎ ☛ ☛ ✡ ✡ ☎ ✂ ✌ ✒ ☛ ☛ ✟ ✠ ✒ � ✪ ✎ ✝ ✍ ✌ ☛ ☛ ☛ ✖ ✠ ✟ ✝✞ ✆ ☎ ✫ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ✝ ✏ ✛ ☎ ✗✩ ✕ ☛ ☛ ✜ ✣ ✢ ✕ ✞ ✝ ✒ ☛ ☛ ✡ ✡ ✡ ✗ ✥ ✡ � ✜ ✣ ✙ ✡ ☛ ☛ ✒ ✎ ☛ ✎ ✝ ✍ ✌ ✠ ☛ ☛ ✣ ✡ ✡ ✞ ✕ ✡ ✥ ✓ ✔ ✗ ☎ ✗ ✤ ✎ ✔ ☎ ☛ ☛ ✜ ✣ ✛ ✢ ✡ ✣ ★ ✜ ✓ ✧ ☎ ✂ ✦ ✝ ✏ ✣ ✒ ✙ ✡ ✤ ✖ ✂ ✦ ✌ ✟ ☛ Web marketing example: Treed logit ✡☞☛ ✡☞☛ ✡☞☛ ✎✑✏ ✎✑✏ ✎✑✏ ✎✑✙ ✎✑✙ ✎✑✙ ✎✑✙ ✁✄✂ 14

t(beta) −20 −10 0 10 20 30 int past.visits node 5 node 3 node 2 2 past.purchases search.pgs 4 info.rltd.pgs uniq.cat.pgs 6 repeat.cat.pgs unique.prod.pgs 8 repeat.prod.pgs 15

Learning Treed Generalized Linear Models Hugh Chipman, University of - PowerPoint PPT Presentation

Learning Treed Generalized Linear Models Hugh Chipman, University of Waterloo Joint work with Ed George (U. Pennsylvania) and Rob McCulloch (U. Chicago) Papers and software available online http://www.stats.uwaterloo.ca/ hachipman/ Relevant

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Generalized linear models Christopher F Baum EC 823: Applied Econometrics Boston College, Spring

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Introduction to General and Generalized Linear Models Generalized Linear Models - part I Henrik

Multiple logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan February 7, 2017

Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik

Generalized Nonlinear Models gnm : a Package for Generalized Nonlinear Models Same form as

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A

Bias reduction in generalized nonlinear models Ioannis Kosmidis and David Firth Department of

Proper Generalized Decomposition for Linear and Non-Linear Stochastic Models Olivier Le Matre 1

MOOCs @ Illinois What were learning about whos taking

journey through data. Maarten Edelman We are a trusted partner of hotels across the globe since

MFI-TransSW+ : Efficiently Mining Frequent Itemsets in Clickstreams Franklin Anderson de Amorim

Outline Clickstream Example What topic is the user looking at on each page?

Driven eCommerce Company BLIBLI.COM Was founded in July 25 th , 2011 Our Product Categories Our

Design Patterns for Large Scale Data Movement Aaron Lee aaron.lee@solacesystems.com Data

Perspectives from Research Cara Laitusis Overview What do we know? Overview of recent

Capital Markets Day 15 February 2017 Agenda 10.00 10.45 Strategy and vision David Arnott,