I found a very nice article on the internet that explained how to develop a multivariate regression model. In the article a dataset for several hundred 2005 used General Motors (GM) cars is supplied which is the basis for the step by step explanation. I think this is a good way to learn about regression models. I thought it was a good idea to try to reproduce the result, intercepts, slopes and plots in the article using R.
This (my) article will just show how to reproduce the results in the for-mentioned article using R. I have also added two other ways to get the best fit using the power of R.
The article uses 4 equations to develop the model into the best fit.
- Equation 1: Based on Mileage alone
- Equation 2: Based on Mileage, Cylinder, Doors, Cruise, Sound, Leather
- Equation 3: Based on Mileage, Buick, Cadillac, Chev, Pontiac, SAAB predicting the log of price
- Equation 4: Based on Mileage, Liter, Doors, Cruise, Sound, Leather, Buick, Cadillac, Chev, Pontiac, SAAB, Conv, Hatchback, Sedan predicting the log of price
Before starting modelling the regressions with R first we need to read the data and setup the required libraries.
library("tidyr") library("dplyr") kuiper <- read.csv("kuiper.csv", sep=";", dec=",")
Equation 1
The first equation is simple and just to explore:
model <- lm(Price ~ Mileage, data=kuiper)
Result:
## (Intercept) 24764.5590
## Mileage -0.1725
## ---
## Multiple R-squared: 0.0205, Adjusted R-squared: 0.0192
To create the plot of the model use this code:
plot(model$residuals, col="blue", type="l") points(model$residuals, col="red", type="p", pch=20) abline(h=0)
Equation 2
Still a quite simple model, just with more predictors.
model_acc <- lm(Price ~ Mileage + Cylinder + Doors + Cruise + Sound + Leather, data=kuiper)
## (Intercept) 7323.1643
## Mileage -0.1705
## Cylinder 3200.1246
## Doors -1463.3991
## Cruise 6205.5113
## Sound -2024.4007
## Leather 3327.1433
## ---
## Multiple R-squared: 0.446, Adjusted R-squared: 0.442
If we plot the model we get 4 plots which explain different aspects of the model.
par(mfrow=c(2,2)) plot(model_acc) par(mfrow=c(1,1))
Fot the histogram and distribution of the residuals we need different plots.
hist(model_acc$residuals, breaks=30) plot(model_acc$residuals, col="blue", type="l") points(model_acc$residuals, col="red", type="p", pch=20) abline(h=0)
Equation 3
For this equation we first need to alter the dataset to add the factor variable Make as boolean variables and take the log of price.
kuiper$id <- 1:nrow(kuiper) make <- as.data.frame(table(kuiper$id, kuiper$Make)) make <- make %>% spread(Var2, Freq) %>% rename(id = Var1) kuiper <- merge(kuiper, make, by="id") kuiper <- select(kuiper, -Make) kuiper$tPrice <- log10(kuiper$Price) model_make <- lm(tPrice ~ Mileage + Liter + Buick + Cadillac + Chevrolet + Pontiac + SAAB, data=kuiper)
## (Intercept) 3.979911512
## Mileage -0.000003476
## Liter 0.099724553
## Buick 0.039968616
## Cadillac 0.249302587
## Chevrolet -0.009372280
## Pontiac 0.013613268
## SAAB 0.345305253
## ---
## Multiple R-squared: 0.917, Adjusted R-squared: 0.916
Same plot as before.
par(mfrow=c(2,2)) plot(model_make) par(mfrow=c(1,1))
Histogram and residuals.
hist(model_make$residuals, breaks=30) plot(model_make$residuals, col="blue", type="l") points(model_make$residuals, col="red", type="p", pch=20) abline(h=0)
Equation 4
For equation 4 the dataset needs to be altered for the Type variable.
type <- as.data.frame(table(kuiper$id, kuiper$Type)) type <- type %>% spread(Var2, Freq) %>% rename(id = Var1) kuiper <- merge(kuiper, type, by="id") kuiper <- select(kuiper, -Type) model_all <- lm(tPrice ~ Mileage + Liter + Doors + Cruise + Sound + Leather + Buick + Cadillac + Chevrolet + Pontiac + SAAB + Convertible + Hatchback + Sedan, data=kuiper)
## (Intercept) 3.91810880
## Mileage -0.00000358
## Liter 0.09576214
## Doors 0.03352658
## Cruise 0.00751699
## Sound 0.00522264
## Leather 0.00625998
## Buick 0.04165259
## Cadillac 0.23303437
## Chevrolet -0.01331520
## Pontiac -0.00042065
## SAAB 0.28109769
## Convertible 0.13781907
## Hatchback -0.08898857
## Sedan -0.07114898
## ---
## Multiple R-squared: 0.952, Adjusted R-squared: 0.951
Again the same plots.
par(mfrow=c(2,2)) plot(model_all) par(mfrow=c(1,1)) hist(model_all$residuals, breaks=30) plot(model_all$residuals, col="blue", type="l") points(model_all$residuals, col="red", type="p", pch=20) abline(h=0)
Let R find the best model
R has the Step function. You can supply two models and R steps through the model to get to the best fit. I’ve removed the variables I don’t want to use for predicting.
kuiper <- kuiper %>% select(-Price, -id, -Model, -Trim) model <- lm(tPrice ~ Mileage, data=kuiper) model_all <- lm(tPrice ~ ., data=kuiper) bestfit <- step(model, scope=list(lower=model, upper=model_all), direction="forward")
Result:
## (Intercept) 4.006511863
## Mileage -0.000003579
## Liter 0.109135503
## SAAB 0.276731601
## Cadillac 0.241903489
## Convertible 0.142481573
## Wagon 0.069803273
## Buick 0.041279545
## Hatchback -0.013685317
## Chevrolet -0.014330497
## Cylinder -0.012359839
## Cruise 0.008355646
## Leather 0.005336807
## Sound 0.004812547
## ---
## Multiple R-squared: 0.952, Adjusted R-squared: 0.952
This model is slightly different and produces almost the same R-squared.
Machine learning
Using the Caret package we can use machine learning to predict the best model on the dataset.
library("caret") ml_model <- train(tPrice ~ ., data=kuiper, method="lm")
## (Intercept) 3.93932861
## Mileage -0.00000358
## Cylinder -0.01224029
## Liter 0.10880574
## Doors 0.03377855
## Cruise 0.00808055
## Sound 0.00462693
## Leather 0.00517194
## Buick 0.04375913
## Cadillac 0.24448881
## Chevrolet -0.01264695
## Pontiac 0.00261046
## SAAB 0.27901919
## Saturn NA
## Convertible 0.14078278
## Coupe NA
## Hatchback -0.08269889
## Sedan -0.06948186
## Wagon NA
## ---
## Multiple R-squared: 0.952, Adjusted R-squared: 0.951
Remarks
I would like to add that normally you would not use the complete dataset to fit a model for prediction. You would want to split the dataset into a training and testset to be able to cross validate your model to prevent over fitting of the model.