The (Random) Forest for the (Decision) Trees William Warfel Office - PowerPoint PPT Presentation

The (Random) Forest for the (Decision) Trees William Warfel Office of Institutional Research Washington State University RMAIR 2016 Admission and enrollment predictive tool

Overview Goal: To assist administrative planning – Decision Tree – Random Forest – Data Specification and Needs – R-Studio – Random Forest Model – Prediction Results – Cautions – Next Steps – This model is used to predict new first-time freshmen enrollment. Other uses can include projection – of graduation, retention, yield, etc.

Research Proposal: The Goal. To accurately predict Freshmen enrollment within 2.0%. • • Then, efficiently utilize Admissions efforts to capture the most successful students and retain them. First Time Freshmen 25000 20000 21302 18428 17411 15000 15572 14935 14280 14026 10000 11622 5000 5062 4873 4643 4505 4220 3974 3991 3801 0 2013-14 2014-15 2015-16 2016-17 2017-18 Applied Admitted Confirmed Enrolled

Admission Analysis: The 6 Steps. STAGE STAGE STAGE STAGE STAGE STAGE 1 3 5 2 4 6 WSU Prospect Inquiry Application Admission Confirmation Enroll Offer

What is a Decision Tree? • A tree-like tool used for graphical modeling that provides the possible decisions of possible outcomes. – Outcomes include chance, utility, cost, enrollment, etc. • Most commonly used in decision analysis to identify the most likely strategy or end goal. – Think elections! • Difficulty: – Imperfect information. – Changes to underlying data. • I.E. Cost of attendance or changes to admission criteria. • Solution: – Conditional probability. • Work Backwards!

Decision Tree Examples!

The Decision Tree Not Enroll Housing Enroll Accept Offer Not Enroll WSU Apply Admission Alive Offer Reject Offer Enroll Nothing

What is a Random Forest? • Predictive performance for classification and regression that construct multiple decision trees. • In essence, bootstrapping (combining) multiple decision trees. • Random forests correct for a decision trees’ habit of overfitting to the training set. • Creates more trees – Tree Bagging (true technical term). – Bagging – to average noisy and unbiased models to create a model with low variance Think Amazon.com! •

Random Forest: Amazon! Amazon Screen Shots (Me) Amazon Screen Shots (Becky)

The Random Forest. Not Enroll Not Enroll Not Enroll Not Enroll Housing Not Enroll Not Enroll Housing Housing Housing Enroll Housing Housing Not Enroll Enroll Enroll Not Enroll Enroll Enroll Not Enroll Enroll Housing Housing Accept Offer Not Enroll Housing Enroll WSU Accept Offer Not Enroll Accept Offer Not Enroll Enroll Accept Offer Not Enroll Apply Admission Alive WSU Accept Offer Not Enroll Enroll WSU WSU Accept Offer Not Enroll Offer WSU Apply Admission Alive Apply Admission Alive Apply WSU Admission Alive Reject Offer Enroll Apply Admission Alive Offer Offer Apply Offer Alive Admission Accept Offer Not Enroll Reject Offer Offer Enroll Reject Offer Enroll Accept Offer Not Enroll Reject Offer Enroll Offer WSU Reject Offer Nothing Enroll Accept Offer Not Enroll WSU Reject Offer Enroll Apply Admission Alive Nothing WSU Nothing Apply Admission Alive Nothing Offer Nothing Apply Admission Alive Offer Nothing Reject Offer Enroll Offer Reject Offer Enroll Reject Offer Enroll Nothing Nothing Nothing

Model: Needed Data The ENTIRE KITCHEN SINK! • Bring in all the data believed to predict enrollment. – Ethnicity – Sex – Housing – Freshmen / Transfer Orientation • Completed and Future Attendees – Financial Aid (FAFSA & Fin Aid Interest) • Scholarships awarded – Confirmations – Admission communication

Model: Data Cautions Look out for: • Institutional actions that create inconsistencies within the data • Administrative or Legislative changes! – An admission waitlist is imposed – Changes to the admission criteria – Tuition decreases! • And at greater rates at other in-state universities. • Students that cancel housing contacts but remain active applicants.

Why R-Studio? • Free-ware • Pre-loaded packages allow for specialized statistical techniques – Random Forest – Bootstrapping – LOTS and LOTS more! • University-wide cross-collaboration • Forums-on-Forums-on-Forums for help!

Working with Data in R Studio

Working with Data in R Studio • Import file using CSV format • Must specify headers • When using variables, everything is case sensitive! • Partition a data set • Warning messages will drive you mad!

The Random Forest Model • Three Different Data Sets • Train – Fall 2014 Admits • Test – Fall 2015 Admits • Project – Fall 2016 Admits • All data is after the 6 th orientation session • Late June/early July • Different Random Forest Models • All Applicants • All Admitted • All Confirmed

Random Forest: Output • Example: Enrollment as a function of housing, confirmations, admitted, Pell eligibility, etc. • Number of Tree: 500 is standard for R • No. of variables tried at each split is set by the algorithm to find the best match within the training data set.

Random Forest: The out of box (OOB) error rate & Confusion Matrix • The OOB error (or estimate) is the error rate of the random forest predictor • A method to measure the Random Forest error of prediction • The OOB confusion matrix is obtained from the RF predictor • Consists of true positive, true negative, false positive, and false negative

Random Forest: Variable Importance Plot • Provides the mean decrease in accuracy of all variables. • In the example provided, Attendance Code at orientation along with housing and confirmation status are the most accurate variables in prediction.

Moment of Truth: The Projection Prediction – All Applicants Fall 2015: This Random Forest model predicts that 28.0% of all applicants will enroll Fall 2016. 4225/15090 = 28.0% Fall 2016: Compared to Fall 2015: • 4201/(13247+4201) = 24.08% • How accurate: • Fall 2015: 18428*24.08% = 4437 vs Actual 4220 • Fall 2016: 21302*28.0% = 5889 vs Actual 3991

Moment of Truth: The Projection Prediction – All Applicants

Moment of Truth: The Projection Prediction – Admitted This Random Forest model predicts that 28.1% of admitted students will enroll Fall 2016. 4211/(12327+4211) = 25.5% Compared to Fall 2015: • 4167/(11440+4167) = 26.7% How accurate: • Fall 2015: 14935*26.7% = 3988 vs Actual 4220 • Fall 2016: 15572*25.50% = 3971 vs Actual 3991

Moment of Truth: The Projection Prediction – Admitted

Moment of Truth: The Projection Prediction – Confirmed This Random Forest model predicts that 81.2% of confirmed students will enroll Fall 2016. 4200/(4200+972) = 81.2% Compared to Fall 2015: • 4142/(4142+1251) = 76.8% How accurate: • Fall 2015: 5062*76.8% = 3888 vs Actual 4220 • Fall 2016: 4643*81.2% = 3770 vs Actual 3991

Moment of Truth: The Projection Prediction – Confirmed

Cautions • Null values in R will cause problems. – Financial aid variables become problematic. 0 is not the same as null • • Dates vs. Events – Snap Dates across admission cycles must be consistent for prediction – Or, as I prefer, use data after the same New Student Orientation I used the 6 th orientation session year-over-year in this analysis • • Other models can assist in calibrating and ensuring model accuracy – Logistic regression, Markov-Chain models, etc. • R does have a learning curve

Next Steps • Adjust model to focus on sub-populations of the admission pool – Honors students, STEM Majors, URM, etc. • Apply across the institution – WSU Tri-Cities – WSU Vancouver – WSU Spokane – WSU North Puget Sound at Everett – WSU Global • Apply to other areas of student prediction models – degree-time completion.

References Headstrom, Ward. Using a Random Forest model to predict enrollment. Humboldt State University. CAIR 2013 Herzog, Serge. Estimating Student Retetnion and Degree- Completion Time: Decision Trees and Neural Networks vis-à-vis Regression. New Directions for Institutional Research. No. 131, Fall 2006. Sampath, V., Flagel, A., Figueroa, C. A Logistic Regression Model to Predict Freshmen Enrollments.

The (Random) Forest for the (Decision) Trees William Warfel Office - PowerPoint PPT Presentation

The (Random) Forest for the (Decision) Trees William Warfel Office of Institutional Research Washington State University RMAIR 2016 Admission and enrollment predictive tool Overview Goal: To assist administrative planning Decision Tree

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

A forest is a bunch of trees Figure: A forest of three trees Definition A forest is a graph

Tree Computation for Ranking and Classification CS240A, T. Yang, 2016 Outlines Decision Trees

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

The number of spanning trees of random 2 -trees Stephan Wagner (joint work with Elmar Teufl)

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

U.S. Forest Service Forest Service U.S. Forest Inventory and Analysis Forest Service Research

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Strategic Plan What is an urban forest? Phillys urban forest includes Street trees Park

Advances in Decision Tree Construction Johannes Gehrke Cornell University

NCSA A Conference, June 30, 30, 2017 2017 Angela Bilyeu and Maria Harris Oklahoma State

Table of Deadlines for Non-Grandfathered Health Plans to Implement New Claims Procedures

Previously adopted sigma is the average among-assessment standard deviation (in log space) of

By Herb Blank Over the past six months, I have led the team that developed the Thomson Reuters

Deep Learning: multi-layer neural networks Recurrent Neural Networks: sequence data Long

PROGRAM OVERVIEW 03/26/2019 Page 1 of 16 FINAL PRESENTATION Prioritization Methodology

Decision Tree Based Learning of Program Invariants Deepak DSouza Department of Computer