Revenue Prediction of House Resale Resale Bairong Lei University of Waterloo November 6, 2012
Overview � Motivation � Previous Works � Project Goals � Dataset � Dataset � Plan for Analysis
Motivation � People are favor of the ownership of a valuable property. � Home investment is treated as a hedge against � Home investment is treated as a hedge against inflation. � House resale is expected to be able to make a profit.
Census structure of private home 30.5% 29.5% 29.0% 28.5% 28.5% 27.6% 26.5% 26.8% 25.7% household type - Source: Statistics Canada
Motivation Cont’ � New home purchasing VS. resale home purchasing: Issues to Concern New Homes Resold Homes Registration for a home Registration for a home Needed Needed Not needed Not needed builder List Prices Unknown Known Renovation Cost of Upgrade Usually Included in Price Appliance May or May Not Usually included in Price Locations Unpredictable Fixed Offer Presentation Not needed Needed
Previous Work � Basu, S. and Thibodeau, T. Analysis of Spatial Auto-correlation in House Prices. Journal of Real Estate Finance and Economics, Vol. 17:1, 61-85 (1998). � Structural characteristics increases hedonic house price prediction accuracy. � � Chopra, S., Thampy, T., Leahy, J., Caplin, A., and LeCun, Y. Machine Chopra, S., Thampy, T., Leahy, J., Caplin, A., and LeCun, Y. Machine Learning and the Spatial Structure of House Prices and Housing Returns (2008). � Applying linear regression model to account for geography factor to reduce error for price prediction over a long period. � Question: How to predict the revenue when selling a house? What are the factors to affect the revenue when selling a houses?
Project Goals � Predict the difference between sold prices and listing prices of resold houses (regression problem) � Predict whether the sold prices is greater than the asking prices (classification problem)
Raw Dataset - Source: realmarketwatch.com
Raw Dataset � Source: realmarketwatch.com � Description: Resold house Records in Great Toronto Area in recent two weeks � Fields include: � Fields include: � MLS Number, City, Street Number, Street Name, Street Type, Area, House Type, House Style, Number of Bedrooms, Number of Bathrooms, Contract Date, Sold Date, Ask Price, Sold Price
Raw Data Cont’ � Overview of the raw dataset � Number of Records: 4194 � City: totally 54 distinct names � Area: 340 districts � Street Types: 38 distinct types � Street Types: 38 distinct types � House Types: 15 � House Styles: 10 � No. of Bedrooms: 0 ~ 9 � No. of Washrooms: 0 ~ 11 � Ask Price: 89900 ~ 7995000 � Sold Price: 2200 ~ 7025000
Raw Dataset Cont’ � Example: � City: Aurora � St. No.: 51 � St. Name: Cashel � St. Type: Crt � Area: Aurora Hig � Ask Price: 329777 Ask Price: 329777 � Contract Date: 10/09/2012 � Sold Price: 320000 � Sold Date: 24/09/2012 � House Type: Att/Row/Tw � House Style: 2-Storey � Bedroom: 3 � Washroom: 2
Challenges of Raw Data � No house records for Halton region in GTA � Fields with Invalid Data � 0 Bedrooms � 0 Bathrooms � 0 Bathrooms � Ambiguous data � House style as “Vacant land” � Any suggestion on imputation for these misleading data? (mean, hot-deck or machine learning methods?)
Plan for Analysis � Overview � Pre-processing raw data � Potential machine learning methods � Validation
Overview � Feature reconstruction from raw data � Goal: to group categorical and qualitative data into levels to be more ML descriptive (feature encoding) � Focus on features of City, house type, house style, � number of bedrooms and number of bathrooms � Primarily focus on supervised learning methods � Regression method � Classification method
Pre-processing raw data � Feature Encoding � Apply dummy variables to feature construction for qualitative variables � City names are categorized into four regions (City � City names are categorized into four regions (City of Toronto, Peel, York, Durham) Z 1 = 1 if the house resides in Peel, else Z 1 = 0; Z 2 = 1 if the house resides in York, else Z 2 = 0; Z 3 = 1 if the house resides in Durham, else Z 3 = 0;
Machine Learning Methods � Regression problem � Multivariate Linear Regression � Classification problem � Classification problem � Support Vector Machine � Decision Tree
Multivariate Linear Regression � Use encoded qualitative variables to build up models � Recall: City names are categorized into four regions (City of Toronto, Peel, York, Durham) Z 1 = 1 if the house resides in Peel, else Z 1 = 0; Z 2 = 1 if the house resides in York, else Z 2 = 0; Z = 1 if the house resides in Durham, else Z = 0; Z 3 = 1 if the house resides in Durham, else Z 3 = 0; � Y i = α 0 + α 1 Z 1 + α 2 Z 2 + α 3 Z 3 � Training models with training samples � Generate conclusion with the testing results using test sets.
Support Vector Machine � Select features and generate feature subsets � Build up models with various feature subsets and Gaussian Radial basis kernel Gaussian Radial basis kernel � Apply K-fold cross validation to training models � Compare averaged misclassification rates for each feature subset
Decision Tree � Build up the tree with C4.5 algorithm � Handle training set with unknown attribute values by evaluating the gain or gain ratio for that attribute � Pruning would run after tree is created � Pseudocode of C4.5Algorithm: 1. 1. Check for base cases Check for base cases 2. For each attribute a Find the normalized information gain from splitting on a 3. Let a_best be the attribute with the highest normalized information gain 4. Create a decision node that splits on a_best 5. Recurse on the sublists obtained by splitting on a_best , and add those nodes as children of node - Source: C4.5 Algorithm http://en.wikipedia.org/wiki/C4.5_algorithm
Validation � Test data set � New real data released from the website to test the prediction accuracy of the models for test the prediction accuracy of the models for those machine learning methods
Reference � Statistics Canada. Distribution (in percent-age) of private households by household type, 2001 to 2011. http://www12.statcan.gc.ca/census-recensement/2011/as- sa/98-312-x/2011003/fig/fig3_2-1-eng.cfm � Basu, S. and Thibodeau, T. Analysis of Spatial Auto-correlation in House Prices . Journal of Real Estate Finance and Economics, Vol. 17:1, 61-85 (1998). � Chopra, S., Thampy, T., Leahy, J., Caplin, A., and LeCun, Y. Machine Learning and the Spatial Structure of House Prices and Housing Returns (2008). � Antipov E. and Pokryshevskaya, E. Mass appraisal of residential apartments: An application of Random forest for valuation and a CART-based approach for model diagnostics . Working Paper(2010). � RealMarketWatch http://realmarketwatch.com/ � C4.5 Algorithm http://en.wikipedia.org/wiki/C4.5_algorithm
Thank You! Thank You!
Recommend
More recommend