blue book for bulldozers
play

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a - PowerPoint PPT Presentation

Blue Book for Bulldozers Predicting Auction Sale Price to C reate a Blue Book for a Piece of Heavy Equipment for Bulldozers Yoojong Bang, Joon Lim, Benedict Lim, Samuel Hills , Eun Hee Ko MASTER IN ANALYTICS I NORTHWESTERN UNIVERSITY


  1. Blue Book for Bulldozers Predicting Auction Sale Price to C reate a “Blue Book” for a Piece of Heavy Equipment for Bulldozers Yoojong Bang, Joon Lim, Benedict Lim, Samuel Hills , Eun Hee Ko MASTER IN ANALYTICS I NORTHWESTERN UNIVERSITY

  2. Project Overview Kaggle competition sponsored by FastIron Predict auction price of bulldozers Training set – Over 400k observations – 52 predictor variables Predictor variables consist of information on machine size, usage and configuration of equipment

  3. Forecast Goal The validation criteria for this competition is residual mean squared log error (RMSLE). 𝑜 1 𝑗 + 1 − log 𝑍ℎ𝑏𝑢 𝑗 + 1 ) 2 𝑆𝑁𝑇𝑀𝐹 = 𝑜 (log 𝑍 𝑗=1 Current top ranked model has an RMSLE of 0.2209. Our cross-validated estimate of RMSLE beats this value, but does not reach these for the validation set provided by Kaggle.

  4. Challenges Variable Sparsity – Majority of predictor variables are very sparse – No observations contain values for all predictors – Even subsetting predictors, most models do not take null values Multicollinearity – Many predictors are identical Categorical Variables – Almost all predictors are categorical – Most local linear regression models do not accept categorical variables

  5. Data Description 1 st Qu. 3 rd Qu. Min. Median Mean Max. Response Variable – Sale Price 4750 14,500 24,000 31,100 40,000 142,000 Data Transformed log10 1 st Qu. 3 rd Qu. Min. Median Mean Max. NA’s MachineHoursCurrentMeter 0 0 0 3,458 3,025 248,300 258,360 1 st Qu. 3 rd Qu. Min. Median Mean Max. NA’s YearMade 1919 1988 1996 1994 2001 2013 38,185 Enclosure: State: fiProductClassDesc Five Values 53 Values 74 values NA’s = 2 NA’s = 0 NA’s = 2801

  6. Linear Regression Model Setup Parameters • No Parameters Data Description • Split data into 57 Product Classes • Linear regression on • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.220586345 • Max: 0.730789603 • Average: 0.400408126 Additional Remarks • Average coefficient • Number of machine hours = -245.7014 • Year made = 11585.12. • These coefficients make sense because it means that the longer the machine has been used the lower the price, and the ‘younger’ the machine the more valuable it is.

  7. Ridge Regression Model Setup Parameters • 20 values of lambda • X=1:20 • Lambda=1/(1.5^(X-1)) Data Description • Split data into 57 Product Classes • Linear regression on • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.511227413 • Max: 1.818759302 • Average: 1.01693174 • Additional Remarks • Little correlation between predictor variables • Lambda that had the lowest RMLSE for all the 57 product classes was the smallest lambda of 0.00045,

  8. K-Nearest Neighbor Classification (KNN) Model Setup Parameters • Number of Nearest Neighbors: 3->10 Data Description • Split data into 57 Product Classes • KNN on • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.206888639 • Max: 0.657082542 • Average: 0.34215316 Additional Remarks Number of Nearest Neighbors 5 6 7 8 9 10 Number of Product Classes with 1 1 4 6 12 33 Associated Optimal K

  9. Support Vector Machines Classification (SVM) Model Setup Parameters • Four types of kernels (Tuning Parameters) • Polynomial (Degree and Gamma) • Sigmoid (Gamma and Coefficient) • Radial (Gamma) • Linear (Gamma) Gamma Range: 10 -6 to 0.1 • • Degree Range: 2 to 6 • Coefficient Range: 0 to 3 Data Description • Split data into 57 Product Classes • SVM on: • State of sale • Type of enclosure • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.208401176 • Max: 0.519523311 • Average: 0.329685653

  10. Boosting (GBM) Model Setup Parameters • Interaction terms {1, 2, 3, 4} • S hrinkage parameter {0.1, 0.2, …, 1.0} Data Description • Single model run • Subset of 26 variables • Mix of quantitative and qualitative RMLSE Results • Min: 0.1447 • Max: 0.1597 • Average: 0.1483 Additional Remarks • Even with variable selection, only a few predictor variables are dominant • Chose 100 trees to create model • Chosen heuristically • Tried 10, 100, and 1000 on a few models

  11. Regression Trees (CART) Model Setup Parameters • Prone the tree based On Cp = 0.1 Data Description • Used all 52 Variables RMLSE Results • 0.3318572 Additional Remarks • fiProductClassDesc is the most important predictor variable. • Error is randomly distributed. E[error]=0.

  12. GAM Model Setup Parameters • No parameters Data Description • Split data into 57 Product Classes • Variable used to fit GAM: • Number of Machine Hours on the Current Meter • Year made RMLSE Results • Min: 0.214957079 • Max: 0.683211532 • Average: 0.35196

  13. MARS Model Setup Parameters • Degrees of interaction {1, 2, 3, 4} Data Description • Single model run • Subset of 5 variables • Mix of quantitative and qualitative RMLSE Results • Min: 0.1497 • Max: 0.1512 • Average: 0.1501 Additional Remarks • Subset of variables chosen because R package can only be run on observations without null values • Because of the small number of variables, the model with 2, 3, and 4 interaction terms were identical (due to nature of backward pass)

  14. Random Forest Model Setup Parameters • N.tree=1000 Data Description • Random Forest on: • MachineID • ProductGroup • YearMade • Saledate RMLSE Results • 0.4819 Additional Remarks • Two R packages: randomForest vs. party (difference lies in variable importance and base tree) • Very high computational power required – especially RAM 8G not enough • randomForest() requires non-null variables, less than 8 categories. • cforest() can handle missing values and more categories but take way too long time.

  15. Stacked Generalization - Staking We stacked three models with Squared-Error Loss Function. 1) Random Forest : 10 fold CV RMSLE = 0.4819 2) Regression Tree: 10 fold CV RMSLE = 0.3365 3) Gradient Boosted Model: 10 fold CV RMSLE = 0.1447 Average RMLSE of above three model = 0.321 Coefficients: Estimate Std. Error t value Pr(>|t|) RandF -0.046102 0.001138 -40.52 <2e-16 *** CART 0.391332 0.001374 284.86 <2e-16 *** GBM 0.655589 0.001504 435.97 <2e-16 *** Stacked Generalization Model has 10 fold RMSLE of 0.2646284

Recommend


More recommend