Housing Price Predictions
This post is a work in progress. I hope to start out with a bare-bones modeling workflow using the Tidymodels ecosystem for machine learning in R. Once we have a basic workflow, we can try and improve model performance by switching out the model, resampling and/or further custominzing the data engineering steps.
We are going to use the Ames housing dataset, available in the AmesHousing
package, to create a model for price prediction. The dataset contains 81 features on 3000 home sales around the University in Ames, Iowa between 2006-2010. We’re going to try and use these features to train a linear regression model. This version of the data is clean but we will need to do some basic data engineering before we can train our model and test it.
ames_df <- make_ames() %>%
janitor::clean_names() # extracting the data from the AmesHousing package and converting all column names to lower snace_case
# Here are a few functions that I used to get an initial feel for the dataset
#ames_df %>% count(neighborhood)
#summary(ames_df)
#View(ames_df)
#dim(ames_df)
Since sale price is our feature of interest, let’s consider the home sale price distribution. Even before we visualize sale price, we can imagine that the distribution might be skewed in the direction of higher price, owing to a few uber-expensive mansions. If that’s the case, we will want to apply a log transformation to the sale price before modeling.
Let’s also visualize how the house sales are distributed across neighborhoods.
ames_df %>%
count(neighborhood) %>%
mutate(neighborhood = str_replace_all(neighborhood, "_", " "),
neighborhood = fct_reorder(neighborhood, n)) %>%
ggplot(aes(n, neighborhood, fill = neighborhood)) +
geom_col() +
labs(x = '# of houses sold by neighborhood',
y = "") +
theme(legend.position = 'none')
How has the sale price changed with time? The boxplots below show the sale price for each year in the dataset. The median price, specified by the black line in each box, is relatively stable over the five-year span.
p <- ames_df %>% select(neighborhood, year_sold, sale_price) %>%
group_by(year_sold) %>%
mutate(avg_sale_price = mean(sale_price),
year_sold = as.factor(year_sold)) %>%
arrange(year_sold) %>%
ggplot(aes(year_sold, sale_price)) +
geom_boxplot() +
theme(legend.position = 'none') +
scale_y_continuous(labels = comma)
n <- ggplotly(p)
n
Geospatial analysis
The dataset includes longitude and latitude coordinates for each house. Using the leaflet package, we can plot the house locations by neighborhood. In addition, we can weight the circle size based on sale price. In order to better visualize the individual circles, dplyr::slice_sample()
can be used to randomly sample a limited number of houses. We will select 600 house locations to plot on the Leaflet map shown below.
library(RColorBrewer)
pal <- colorFactor(
palette = 'Dark2',
domain = ames_df$neighborhood
)
library(RColorBrewer)
pal <- colorFactor(
palette = 'Dark2',
domain = ames_df$neighborhood
)
ames_random_samp <- ames_df %>%
slice_sample(n = 600)
ames_random_samp %>% group_by(neighborhood) %>% leaflet(data = .,
options = leafletOptions(zoomControl = FALSE,
minZoom = 13, maxZoom = 13, dragging = FALSE)) %>%
addTiles() %>%
addCircles(~longitude, ~latitude, color = ~pal(neighborhood), radius = 1, weight = ~sale_price*2e-5, opacity = 1)
Splitting the data
Now that we have a better feel for the data, let’s use the rsample
package to split the data into training and test sets. For more robust predictions, we may also want to use vfold_cv()
to resample the training set later on.
library(rsample)
set.seed(518)
ames_split <- initial_split(ames_df, prop = 0.8, strata = "sale_price")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
Data preprocessing
Feature engineering with the recipes
package provides an expansive series of data manipulation steps that can be woven together to elegantly preprocess our data before machine learning.
library(recipes)
ames_rec <-
recipe(sale_price ~ ., data = ames_train) %>% # Specify the model--in our case, we are using all variables (.) to predict sale price.
step_log(sale_price, base = 10) %>% # take the log of sale price
step_other(neighborhood, threshold = 0.01) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal())
Model specification
ames_lm <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
Creating a workflow, fitting the model and making predictions on the testset
library(workflows) #loaded with tidymodels
lm_wkfl <-
workflow() %>%
add_model(ames_lm) %>%
add_recipe(ames_rec)
ames_wkfl_fit <- lm_wkfl %>%
last_fit(split = ames_split)
ames_wkfl_fit %>%
collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.532 Preprocessor1_Model1
## 2 rsq standard 0.729 Preprocessor1_Model1
ames_wkfl_fit %>%
collect_predictions()
## # A tibble: 588 x 5
## id .pred .row sale_price .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split 0.441 1 0.621 Preprocessor1_Model1
## 2 train/test split 1.06 4 0.929 Preprocessor1_Model1
## 3 train/test split 0.188 5 0.320 Preprocessor1_Model1
## 4 train/test split 0.417 8 0.340 Preprocessor1_Model1
## 5 train/test split 0.962 9 0.853 Preprocessor1_Model1
## 6 train/test split 0.638 15 0.587 Preprocessor1_Model1
## 7 train/test split 1.08 18 2.10 Preprocessor1_Model1
## 8 train/test split -1.04 19 -0.403 Preprocessor1_Model1
## 9 train/test split -0.105 25 -0.255 Preprocessor1_Model1
## 10 train/test split -0.800 27 -0.677 Preprocessor1_Model1
## # … with 578 more rows
Visualizing R-squared
ames_wkfl_fit %>%
collect_predictions() %>%
mutate(sale_price = 10^sale_price,
.pred = 10^.pred) %>%
ggplot(aes(x = sale_price, y = .pred)) +
geom_point() +
geom_abline(color = 'blue', linetype = 2) +
coord_obs_pred() +
labs(title = 'R-Squared Plot',
y = 'Predicted Sale Price',
x = 'Actual Sale Price')
ames_wkfl_fit %>%
collect_predictions() %>%
ggplot(aes(x = sale_price, y = .pred)) +
geom_point() +
geom_abline(color = 'blue', linetype = 2) +
coord_obs_pred() +
labs(title = 'R-Squared Plot',
y = 'Predicted Sale Price (log)',
x = 'Actual Sale Price (log)')