Random Forest Applied Multivariate Statistics Spring 2012 Overview - PowerPoint PPT Presentation

Random Forest Applied Multivariate Statistics – Spring 2012

Overview  Intuition of Random Forest  The Random Forest Algorithm  De-correlation gives better accuracy Healthy Diseased  Out-of-bag error (OOB-error) Healthy  Variable importance Diseased Diseased 1

Intuition of Random Forest Tree 2 Tree 1 young old young old diseased healthy diseased healthy male female tall short healthy healthy healthy diseased Tree 3 New sample: retired working old, retired, male, short Tree predictions: healthy healthy diseased, healthy, diseased tall short Majority rule: healthy diseased diseased 2

The Random Forest Algorithm 3

Differences to standard tree  Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)  For each split, consider only m randomly selected variables  Don’t prune  Fit B trees in such a way and use average or majority voting to aggregate results 4

Why Random Forest works 1/2  Mean Squared Error = Variance + Bias 2  If trees are sufficiently deep, they have very small bias  How could we improve the variance over that of a single tree? 5

Why Random Forest works 2/2 i=j De-correlation gives Decreaes, if better accuracy 𝜍 decreases, i.e., if m decreases Decreases, if number of trees B increases (irrespective of 𝜍 ) 6

Estimating generalization error: Out-of bag (OOB) error  Similar to leave-one-out cross-validation, but almost without any additional computational burden  OOB error is a random number, since based on random resamples of the data Out of bag samples: Data: Resampled Data: old, tall – healthy old, tall – healthy young, short – diseased old, short – diseased old, short – diseased young, short – healthy young, tall – healthy young, tall – healthy young, tall – healthy young, short – diseased young, tall – healthy old, short – diseased young, short – healthy young, tall – healthy young old old, short – diseased diseased Out of bag (OOB) error rate: healthy tall short ¼ = 0.25 healthy diseased 7

Variable Importance for variable i using Permutations Data Resampled Resampled Dataset 1 Dataset m OOB OOB … Data 1 Data m Permute values of variable i in OOB Tree 1 Tree m data set OOB error e 1 OOB error e m d 1 = e 1 – p 1 d m =e m -p m OOB error p m OOB error p 1 P m d = 1 i =1 d i d m v i = P m s d 1 s 2 i =1 ( d i ¡ d ) 2 d = m ¡ 1 8

Trees vs. Random Forest + Trees yield insight into + RF as smaller prediction decision rules variance and therefore usually a better general + Rather fast performance + Easy to tune + Easy to tune parameters parameters - Rather slow - “Black Box”: Rather hard - Prediction of trees tend to get insights into decision to have a high variance rules 9

Comparing runtime (just for illustration) • Up to “thousands” of variables • Problematic if there are categorical predictors with many levels (max: 32 levels) RF: First predictor cut into 15 levels RF Tree 10

RF vs. LDA + Can model nonlinear + Very fast class boundaries + Discriminants for visualizing + OOB error “for free” (no group separation + Can read off decision rule CV needed) + Works on continuous and - Can model only linear class categorical responses boundaries (regression / classification) - Mediocre performance + Gives variable - No variable selection importance - Only on categorical response + Very good performance - Needs CV for estimating x prediction error x x x x x x x - “Black box” x x x x x x x x x x x x - Slow x x x x 11

Concepts to know  Idea of Random Forest and how it reduces the prediction variance of trees  OOB error  Variable Importance based on Permutation 12

R functions to know  Function “ randomForest ” and “ varImpPlot ” from package “ randomForest ” 13

Random Forest Applied Multivariate Statistics Spring 2012 Overview - PowerPoint PPT Presentation

Random Forest Applied Multivariate Statistics Spring 2012 Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Healthy Diseased Out-of-bag error (OOB-error) Healthy Variable

U.S. Forest Service Forest Service U.S. Forest Inventory and Analysis Forest Service Research

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Epping Forest Arts Epping Forest Arts Epping Forest Councils Epping Forest Councils Arts

Forest management associations Forest owners own associations Forest Management Association is

CURRENT U.S. FOREST DATA AND MAPS Forest age FIA MapMaker Forest ownership TPO Data CURRENT

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

Forest Health Protection Priorities in the US Forest Service Rick Cooksey Continental Dialogue

National Forest Monitoring and National Forest Inventory at FAO FAO Forestry

US Forest Service Presentation Forest Health and Water Implications United State Forest Service

Logic-based Evaluation of Forest Logic-based Evaluation of Forest Ecosystem Sustainability

PERTINENT FACTS ABOUT THE FOREST SURVEY "What is the Forest Survey? Edward C. Crafts, Chief,

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Introd u ction to Random Forest TR E E - BASE D MOD E L S IN R Erin LeDell Instr u ctor Random

Interpretation of forest plots Part I 1 At the end of this lecture, you should be able to

Sustainable forest management, National Forest Strategy in Hungary Andrs Szepesi Department

semiconductors [Fonstad, Sze02, Ghione] Semiconductors Conducibility: - Insulators: s <10-

Understanding the Economic Imperative of Energy Efficiency * John A. Skip Laitner

Energy management Water and steam Exergy Exergy by heat transfer Exergy in the case steam

Providing Input-Discriminative Protection for Local Differential Privacy Xiaolan Gu * , Ming Li *

Astrophysical Probes of Dark Matter Ting Li, Alex Drlica-Wagner Buckley & Peter 2017

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan Fu, Mayee Chen, Fred Sala,

NOBIS DA ECONOMIA KAHNEMAN E Richard Thaler Daniel Kahneman THALER: APLICAO AO

Human Factors? 1 Causes of incidences in healthcare Building a Safer USA 2000 Institute of

Random Forest Applied Multivariate Statistics Spring 2012 Overview - PowerPoint PPT Presentation

Random Forest Applied Multivariate Statistics Spring 2012 Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Healthy Diseased Out-of-bag error (OOB-error) Healthy Variable

U.S. Forest Service Forest Service U.S. Forest Inventory and Analysis Forest Service Research

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Epping Forest Arts Epping Forest Arts Epping Forest Councils Epping Forest Councils Arts

Forest management associations Forest owners own associations Forest Management Association is

CURRENT U.S. FOREST DATA AND MAPS Forest age FIA MapMaker Forest ownership TPO Data CURRENT

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

Forest Health Protection Priorities in the US Forest Service Rick Cooksey Continental Dialogue

National Forest Monitoring and National Forest Inventory at FAO FAO Forestry

US Forest Service Presentation Forest Health and Water Implications United State Forest Service

Logic-based Evaluation of Forest Logic-based Evaluation of Forest Ecosystem Sustainability

PERTINENT FACTS ABOUT THE FOREST SURVEY &quot;What is the Forest Survey? Edward C. Crafts, Chief,

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Introd u ction to Random Forest TR E E - BASE D MOD E L S IN R Erin LeDell Instr u ctor Random

Interpretation of forest plots Part I 1 At the end of this lecture, you should be able to

Sustainable forest management, National Forest Strategy in Hungary Andrs Szepesi Department

semiconductors [Fonstad, Sze02, Ghione] Semiconductors Conducibility: - Insulators: s &lt;10-

Understanding the Economic Imperative of Energy Efficiency * John A. Skip Laitner

Energy management Water and steam Exergy Exergy by heat transfer Exergy in the case steam

Providing Input-Discriminative Protection for Local Differential Privacy Xiaolan Gu * , Ming Li *

Astrophysical Probes of Dark Matter Ting Li, Alex Drlica-Wagner Buckley &amp; Peter 2017

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan Fu*, Mayee Chen*, Fred Sala,

NOBIS DA ECONOMIA KAHNEMAN E Richard Thaler Daniel Kahneman THALER: APLICAO AO

Human Factors? 1 Causes of incidences in healthcare Building a Safer USA 2000 Institute of

PERTINENT FACTS ABOUT THE FOREST SURVEY "What is the Forest Survey? Edward C. Crafts, Chief,

semiconductors [Fonstad, Sze02, Ghione] Semiconductors Conducibility: - Insulators: s <10-

Astrophysical Probes of Dark Matter Ting Li, Alex Drlica-Wagner Buckley & Peter 2017

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan Fu, Mayee Chen, Fred Sala,