Hello Tidyverse and generating many linear regression models
Always a work in progress
Intro
This post uses data from the for Data Science online learning community. Each Tuesday, a dataset is posted and people share their insights and visualizations like fine wine. You can Twitter the hashtag #TidyTuesday to sample the vintage of the week. This week’s dataset (September 1st, 2020) is from Our World in Data. There are actually five datasets but we are going to focus on global crop yields.
After exploring the data, we will create a model that predict the yield per hectare (yield_hectare ~ population + year), using the year and population. The dataset is too big to use every country and crop we will choose the top 25 countries by population and select 4 common crops. Altother, that will produce 100 linear models. Since that’s too much to digest in a table, we will present the model results as a labeled data visualization where each point represents a model’s adjusted p-value (y-axis) and estimate or slope (x-axis) for a given country and crop.
Download and view the data
Let’s start by downloading the data from the R4DS Github account and take a look at it. The tidytuesdayR package (by Ellis Hughes, co-host of the screencast ‘TidyX’) is a handy package for checking out available datasets, exploring the data dictionary in the Rstudio viewer and loading the desired data (and more). Although simple to use, I’m going to stick with the readr::read_csv()
to import the data as a tibble (Tidyverse
lingo for a modern dataframe object).
# Load the data with a readr function
crop_yields <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv") %>%
janitor::clean_names()
land_use <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv") %>%
janitor::clean_names() %>%
select(entity, code, year, population = total_population_gapminder) %>%
mutate(year = as.numeric(year))
crop_yield <- crop_yields %>%
left_join(land_use, by = c('code', 'year', 'entity'))
skimr::skim(crop_yields)
Name | crop_yields |
Number of rows | 13075 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 12 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
entity | 0 | 1.00 | 4 | 39 | 0 | 249 | 0 |
code | 1919 | 0.85 | 3 | 8 | 0 | 214 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 1990.37 | 16.73 | 1961.00 | 1976.00 | 1991.00 | 2005.00 | 2018.00 | ▇▆▇▇▇ |
wheat_tonnes_per_hectare | 4974 | 0.62 | 2.43 | 1.69 | 0.00 | 1.23 | 1.99 | 3.12 | 10.67 | ▇▅▂▁▁ |
rice_tonnes_per_hectare | 4604 | 0.65 | 3.16 | 1.85 | 0.20 | 1.77 | 2.74 | 4.16 | 10.68 | ▇▇▃▁▁ |
maize_tonnes_per_hectare | 2301 | 0.82 | 3.02 | 3.13 | 0.03 | 1.14 | 1.83 | 3.92 | 36.76 | ▇▁▁▁▁ |
soybeans_tonnes_per_hectare | 7114 | 0.46 | 1.45 | 0.75 | 0.00 | 0.86 | 1.33 | 1.90 | 5.95 | ▇▇▂▁▁ |
potatoes_tonnes_per_hectare | 3059 | 0.77 | 15.40 | 9.29 | 0.84 | 8.64 | 13.41 | 20.05 | 75.30 | ▇▅▁▁▁ |
beans_tonnes_per_hectare | 5066 | 0.61 | 1.09 | 0.82 | 0.03 | 0.59 | 0.83 | 1.35 | 9.18 | ▇▁▁▁▁ |
peas_tonnes_per_hectare | 6840 | 0.48 | 1.48 | 1.01 | 0.04 | 0.72 | 1.15 | 1.99 | 7.16 | ▇▃▁▁▁ |
cassava_tonnes_per_hectare | 5887 | 0.55 | 9.34 | 5.11 | 1.00 | 5.55 | 8.67 | 11.99 | 38.58 | ▇▇▁▁▁ |
barley_tonnes_per_hectare | 6342 | 0.51 | 2.23 | 1.50 | 0.09 | 1.05 | 1.88 | 3.02 | 9.15 | ▇▆▂▁▁ |
cocoa_beans_tonnes_per_hectare | 8466 | 0.35 | 0.39 | 0.28 | 0.00 | 0.24 | 0.36 | 0.49 | 3.43 | ▇▁▁▁▁ |
bananas_tonnes_per_hectare | 4166 | 0.68 | 15.20 | 12.08 | 0.66 | 5.94 | 11.78 | 20.79 | 77.59 | ▇▃▁▁▁ |
# Here are a few other options for getting to know the data
# glimpse(crop_yields)
# summary(crop_yields)
# View(crop_yields)
Tidy data: cleaning and reshaping
Let’s start with a little data cleaning! skim()
from skimr
provides a convenient way to get summary statistics and more. By looking at the data we can see that there are missing values that need to be removed or imputed but other than that, the data is clean. One thing we should consider is whether the data is in a tidy format where each column is a feature and each row a sample. This format is important when we want to model the data but sometimes it’s helpful to pivot the data into a long format for creating visualizations across multiple categories with ggplot2::facet_wrap()
and related functions. We will use the handy function, pivot_longer()
, from the tidyr package to combine crops into a single column and present their respective values in a separate column in preparation for exploratory data analysis (EDA).
# Use pivot_longer from tidyr to tidy the data.
crop_yield_tidy <- crop_yield %>%
rename_all(str_remove, "_tonnes.*") %>%
rename(country = entity) %>%
pivot_longer(wheat:bananas, names_to = 'crop', values_to = 'yield_hectare') %>%
drop_na(yield_hectare)
# Create a vector of the 25 countries with the largest population
top_pops <- land_use %>%
filter(!is.na(code), entity != "World") %>%
group_by(entity) %>%
filter(year == max(year)) %>%
ungroup() %>%
slice_max(population, n = 25) %>%
pull(entity)
kable(head(crop_yield_tidy, n = 10))
country | code | year | population | crop | yield_hectare |
---|---|---|---|---|---|
Afghanistan | AFG | 1961 | 9169000 | wheat | 1.0220 |
Afghanistan | AFG | 1961 | 9169000 | rice | 1.5190 |
Afghanistan | AFG | 1961 | 9169000 | maize | 1.4000 |
Afghanistan | AFG | 1961 | 9169000 | potatoes | 8.6667 |
Afghanistan | AFG | 1961 | 9169000 | barley | 1.0800 |
Afghanistan | AFG | 1962 | 9351000 | wheat | 0.9735 |
Afghanistan | AFG | 1962 | 9351000 | rice | 1.5190 |
Afghanistan | AFG | 1962 | 9351000 | maize | 1.4000 |
Afghanistan | AFG | 1962 | 9351000 | potatoes | 7.6667 |
Afghanistan | AFG | 1962 | 9351000 | barley | 1.0800 |
What can we learn from this dataset?
A great place to start any analysis is by counting the representation of certain categories in the dataset using the count()
or add_count()
function. For instance, we might want to know the count of each country across years which will tell us the number of crops recorded for each country. It often makes sense to combine count()
with a bar plot to visualize meaningful distributions.
EDA is often where I have the most fun in my workflow..plus it’s arguably the most important step in a data analysis pipeline. Among other things, it provides insights that will help us have an intuitive sense about what to expect when we generate machine learning models and evaluate their performance.
crop_yield_tidy %>%
add_count(year, country, sort = T) %>%
filter(country %in% c("Albania", "Africa", "United States")) # Africa has 11 crops, US has 9 and Albania 6.
## # A tibble: 1,522 × 7
## country code year population crop yield_hectare n
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <int>
## 1 Africa <NA> 1961 290214016 wheat 0.693 11
## 2 Africa <NA> 1961 290214016 rice 1.55 11
## 3 Africa <NA> 1961 290214016 maize 1.04 11
## 4 Africa <NA> 1961 290214016 soybeans 0.376 11
## 5 Africa <NA> 1961 290214016 potatoes 8.18 11
## 6 Africa <NA> 1961 290214016 beans 0.587 11
## 7 Africa <NA> 1961 290214016 peas 0.679 11
## 8 Africa <NA> 1961 290214016 cassava 5.66 11
## 9 Africa <NA> 1961 290214016 barley 0.446 11
## 10 Africa <NA> 1961 290214016 cocoa_beans 0.254 11
## # … with 1,512 more rows
crop_yield_tidy %>%
count(country, sort = TRUE) %>%
filter(n >=500) # this gives a list of the countries with the most years + crops 11 (crops) * 51 (year) = 561
## # A tibble: 53 × 2
## country n
## <chr> <int>
## 1 Africa 638
## 2 Americas 638
## 3 Asia 638
## 4 Central America 638
## 5 Democratic Republic of Congo 638
## 6 Eastern Africa 638
## 7 Ecuador 638
## 8 Land Locked Developing Countries 638
## 9 Least Developed Countries 638
## 10 Low Income Food Deficit Countries 638
## # … with 43 more rows
crop_yield_tidy %>%
summarize(max(year)- min(year)) #57 years of recorded data
## # A tibble: 1 × 1
## `max(year) - min(year)`
## <dbl>
## 1 57
crop_yield_tidy %>% # 249 unique countries
distinct(country) %>%
nrow()
## [1] 249
Visual analysis
Let’s start by comparing crop yield by country for the United States to see how crop yields have changed over time. While we’re at it, let’s look at the most abundant crop yields overall.
# Potatoes!
us_yields <- crop_yield_tidy %>%
filter(country == 'United States') %>%
ggplot(aes(year, yield_hectare, color = crop)) +
geom_line() +
facet_wrap(~country, 'free_x') +
theme(plot.title.position = 'plot') +
labs(
title = 'Crop Yield USA',
y = 'Year',
x = 'tonnes per hectare per year'
)
us_yields
# plot of USA data only
crop_yield_tidy %>%
filter(country == 'United States') %>%
mutate(crop = fct_reorder(crop, yield_hectare)) %>%
ggplot(aes(yield_hectare, crop, fill = crop)) +
geom_col() +
facet_wrap(~country, scales = 'free_y') +
theme(legend.position = 'none',
plot.title.position = 'plot') +
labs(
title = 'Crop Yield USA',
y = 'Crop',
x = 'tonnes per hectare'
)
There are too many countries to add to a single plot, so we will likely need to decide how we want to subset the data. For this analysis, let’s use dplyr::slice_sample()
to randomly select a subset of the total countries. Unless a seed is set, the selection will be different each time we run the code chunk. To start, let’s choose 9 countries and visualize the change in crop yields over time.
# set.seed(1014) if you want to select the same 9 countries each time
# Randomly select 9 country names
crops <- crop_yield_tidy %>% distinct(crop) %>% pull()
country_random <- crop_yield_tidy %>%
select(country) %>%
distinct() %>%
slice_sample(n = 9) %>%
pull()
# Plot crop yield per hectare per year.
crop_yield_tidy %>%
filter(country %in% country_random,
crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
mutate(crop = fct_reorder(crop, yield_hectare)) %>%
ggplot(aes(year, yield_hectare, color = crop)) +
geom_line(size = 1, alpha = 0.5) +
facet_wrap(~country, scales = 'free_y') +
theme(plot.title.position = 'plot') +
labs(
y = 'yield per hectare',
title = 'Crop Yield'
)
crop_yield_tidy %>%
filter(country %in% top_pops,
crop == c('maize', 'potatoes', 'wheat', 'bananas')) %>%
mutate(crop = reorder_within(crop, yield_hectare, country)) %>%
ggplot(aes(yield_hectare, crop, fill = crop)) +
geom_col() +
scale_y_reordered() +
facet_wrap(~country, scales = 'free') +
theme(legend.position = 'none',
plot.title.position = 'plot') +
labs(
title = 'Crop Yield',
y = 'Crop',
x = 'yield per hectare'
)
crop_yield_tidy %>%
filter(country %in% top_pops,
crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
mutate(crop = fct_reorder(crop, yield_hectare)) %>%
ggplot(aes(year, yield_hectare, color = crop)) +
geom_line(size = 1, alpha = 0.5) +
theme(plot.title.position = 'plot') +
facet_wrap(~country, scales = 'free_y') +
labs(
y = 'yield per hectare',
title = 'Crop Yield'
)
Next, let’s make a regression model for each of the top 25 countries.
library(broom)
library(ggrepel)
many_models <- crop_yield_tidy %>%
drop_na() %>%
filter(country %in% top_pops,
crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
nest(mod_data = c(year, yield_hectare, population)) %>%
mutate(model = map(mod_data, ~lm(yield_hectare ~ population + year, data = .))) %>%
mutate(coefs = map(model, tidy)) %>%
unnest(coefs)
Here’s what the data looks like after unnesting the coefficients.
many_models %>%
select(country, code, crop, term, estimate, std.error, statistic, p.value, everything()) %>%
head()
## # A tibble: 6 × 10
## country code crop term estimate std.error statistic p.value
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bangladesh BGD wheat (Inter… -4.02e+2 4.99e+1 -8.06 6.96e-11
## 2 Bangladesh BGD wheat popula… -8.05e-8 1.23e-8 -6.56 1.95e- 8
## 3 Bangladesh BGD wheat year 2.07e-1 2.57e-2 8.06 7.01e-11
## 4 Bangladesh BGD maize (Inter… 5.31e+2 2.63e+2 2.02 4.85e- 2
## 5 Bangladesh BGD maize popula… 1.90e-7 6.47e-8 2.94 4.84e- 3
## 6 Bangladesh BGD maize year -2.76e-1 1.36e-1 -2.03 4.71e- 2
## # … with 2 more variables: mod_data <list>, model <list>
Finally, let’s make the results more meaningful by creating a visualization to summarize the model results.
many_models %>%
filter(term == "year") %>%
mutate(p.value = p.adjust(p.value)) %>%
ggplot(aes(estimate, p.value, label = country)) +
geom_vline(xintercept = 0, lty = 2,
size = 1.5, alpha = 0.7, color = "gray50") +
geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
scale_y_log10() +
facet_wrap(~crop) +
geom_text_repel(size = 2.5, family = "Times New Roman")