The (Random) Forest for the (Decision) Trees William Warfel Office of Institutional Research Washington State University RMAIR 2016 Admission and enrollment predictive tool
Overview Goal: To assist administrative planning – Decision Tree – Random Forest – Data Specification and Needs – R-Studio – Random Forest Model – Prediction Results – Cautions – Next Steps – This model is used to predict new first-time freshmen enrollment. Other uses can include projection – of graduation, retention, yield, etc.
Research Proposal: The Goal. To accurately predict Freshmen enrollment within 2.0%. • • Then, efficiently utilize Admissions efforts to capture the most successful students and retain them. First Time Freshmen 25000 20000 21302 18428 17411 15000 15572 14935 14280 14026 10000 11622 5000 5062 4873 4643 4505 4220 3974 3991 3801 0 2013-14 2014-15 2015-16 2016-17 2017-18 Applied Admitted Confirmed Enrolled
Admission Analysis: The 6 Steps. STAGE STAGE STAGE STAGE STAGE STAGE 1 3 5 2 4 6 WSU Prospect Inquiry Application Admission Confirmation Enroll Offer
What is a Decision Tree? • A tree-like tool used for graphical modeling that provides the possible decisions of possible outcomes. – Outcomes include chance, utility, cost, enrollment, etc. • Most commonly used in decision analysis to identify the most likely strategy or end goal. – Think elections! • Difficulty: – Imperfect information. – Changes to underlying data. • I.E. Cost of attendance or changes to admission criteria. • Solution: – Conditional probability. • Work Backwards!
Decision Tree Examples!
Decision Tree Examples!
The Decision Tree Not Enroll Housing Enroll Accept Offer Not Enroll WSU Apply Admission Alive Offer Reject Offer Enroll Nothing
What is a Random Forest? • Predictive performance for classification and regression that construct multiple decision trees. • In essence, bootstrapping (combining) multiple decision trees. • Random forests correct for a decision trees’ habit of overfitting to the training set. • Creates more trees – Tree Bagging (true technical term). – Bagging – to average noisy and unbiased models to create a model with low variance Think Amazon.com! •
Random Forest: Amazon! Amazon Screen Shots (Me) Amazon Screen Shots (Becky)
The Random Forest. Not Enroll Not Enroll Not Enroll Not Enroll Housing Not Enroll Not Enroll Housing Housing Housing Enroll Housing Housing Not Enroll Enroll Enroll Not Enroll Enroll Enroll Not Enroll Enroll Housing Housing Accept Offer Not Enroll Housing Enroll WSU Accept Offer Not Enroll Accept Offer Not Enroll Enroll Accept Offer Not Enroll Apply Admission Alive WSU Accept Offer Not Enroll Enroll WSU WSU Accept Offer Not Enroll Offer WSU Apply Admission Alive Apply Admission Alive Apply WSU Admission Alive Reject Offer Enroll Apply Admission Alive Offer Offer Apply Offer Alive Admission Accept Offer Not Enroll Reject Offer Offer Enroll Reject Offer Enroll Accept Offer Not Enroll Reject Offer Enroll Offer WSU Reject Offer Nothing Enroll Accept Offer Not Enroll WSU Reject Offer Enroll Apply Admission Alive Nothing WSU Nothing Apply Admission Alive Nothing Offer Nothing Apply Admission Alive Offer Nothing Reject Offer Enroll Offer Reject Offer Enroll Reject Offer Enroll Nothing Nothing Nothing
Model: Needed Data The ENTIRE KITCHEN SINK! • Bring in all the data believed to predict enrollment. – Ethnicity – Sex – Housing – Freshmen / Transfer Orientation • Completed and Future Attendees – Financial Aid (FAFSA & Fin Aid Interest) • Scholarships awarded – Confirmations – Admission communication
Model: Data Cautions Look out for: • Institutional actions that create inconsistencies within the data • Administrative or Legislative changes! – An admission waitlist is imposed – Changes to the admission criteria – Tuition decreases! • And at greater rates at other in-state universities. • Students that cancel housing contacts but remain active applicants.
Why R-Studio? • Free-ware • Pre-loaded packages allow for specialized statistical techniques – Random Forest – Bootstrapping – LOTS and LOTS more! • University-wide cross-collaboration • Forums-on-Forums-on-Forums for help!
Working with Data in R Studio
Working with Data in R Studio • Import file using CSV format • Must specify headers • When using variables, everything is case sensitive! • Partition a data set • Warning messages will drive you mad!
The Random Forest Model • Three Different Data Sets • Train – Fall 2014 Admits • Test – Fall 2015 Admits • Project – Fall 2016 Admits • All data is after the 6 th orientation session • Late June/early July • Different Random Forest Models • All Applicants • All Admitted • All Confirmed
Random Forest: Output • Example: Enrollment as a function of housing, confirmations, admitted, Pell eligibility, etc. • Number of Tree: 500 is standard for R • No. of variables tried at each split is set by the algorithm to find the best match within the training data set.
Random Forest: The out of box (OOB) error rate & Confusion Matrix • The OOB error (or estimate) is the error rate of the random forest predictor • A method to measure the Random Forest error of prediction • The OOB confusion matrix is obtained from the RF predictor • Consists of true positive, true negative, false positive, and false negative
Random Forest: Variable Importance Plot • Provides the mean decrease in accuracy of all variables. • In the example provided, Attendance Code at orientation along with housing and confirmation status are the most accurate variables in prediction.
Moment of Truth: The Projection Prediction – All Applicants Fall 2015: This Random Forest model predicts that 28.0% of all applicants will enroll Fall 2016. 4225/15090 = 28.0% Fall 2016: Compared to Fall 2015: • 4201/(13247+4201) = 24.08% • How accurate: • Fall 2015: 18428*24.08% = 4437 vs Actual 4220 • Fall 2016: 21302*28.0% = 5889 vs Actual 3991
Moment of Truth: The Projection Prediction – All Applicants
Moment of Truth: The Projection Prediction – Admitted This Random Forest model predicts that 28.1% of admitted students will enroll Fall 2016. 4211/(12327+4211) = 25.5% Compared to Fall 2015: • 4167/(11440+4167) = 26.7% How accurate: • Fall 2015: 14935*26.7% = 3988 vs Actual 4220 • Fall 2016: 15572*25.50% = 3971 vs Actual 3991
Moment of Truth: The Projection Prediction – Admitted
Moment of Truth: The Projection Prediction – Confirmed This Random Forest model predicts that 81.2% of confirmed students will enroll Fall 2016. 4200/(4200+972) = 81.2% Compared to Fall 2015: • 4142/(4142+1251) = 76.8% How accurate: • Fall 2015: 5062*76.8% = 3888 vs Actual 4220 • Fall 2016: 4643*81.2% = 3770 vs Actual 3991
Moment of Truth: The Projection Prediction – Confirmed
Cautions • Null values in R will cause problems. – Financial aid variables become problematic. 0 is not the same as null • • Dates vs. Events – Snap Dates across admission cycles must be consistent for prediction – Or, as I prefer, use data after the same New Student Orientation I used the 6 th orientation session year-over-year in this analysis • • Other models can assist in calibrating and ensuring model accuracy – Logistic regression, Markov-Chain models, etc. • R does have a learning curve
Next Steps • Adjust model to focus on sub-populations of the admission pool – Honors students, STEM Majors, URM, etc. • Apply across the institution – WSU Tri-Cities – WSU Vancouver – WSU Spokane – WSU North Puget Sound at Everett – WSU Global • Apply to other areas of student prediction models – degree-time completion.
References Headstrom, Ward. Using a Random Forest model to predict enrollment. Humboldt State University. CAIR 2013 Herzog, Serge. Estimating Student Retetnion and Degree- Completion Time: Decision Trees and Neural Networks vis-à-vis Regression. New Directions for Institutional Research. No. 131, Fall 2006. Sampath, V., Flagel, A., Figueroa, C. A Logistic Regression Model to Predict Freshmen Enrollments.
Recommend
More recommend