Hello Tidyverse and generating many linear regression models

Always a work in progress

Photo by Jesse Gardner on Unsplash

Intro

This post uses data from the for Data Science online learning community. Each Tuesday, a dataset is posted and people share their insights and visualizations like fine wine. You can Twitter the hashtag #TidyTuesday to sample the vintage of the week. This week’s dataset (September 1st, 2020) is from Our World in Data. There are actually five datasets but we are going to focus on global crop yields.

After exploring the data, we will create a model that predict the yield per hectare (yield_hectare ~ population + year), using the year and population. The dataset is too big to use every country and crop we will choose the top 25 countries by population and select 4 common crops. Altother, that will produce 100 linear models. Since that’s too much to digest in a table, we will present the model results as a labeled data visualization where each point represents a model’s adjusted p-value (y-axis) and estimate or slope (x-axis) for a given country and crop.

Download and view the data

Let’s start by downloading the data from the R4DS Github account and take a look at it. The tidytuesdayR package (by Ellis Hughes, co-host of the screencast ‘TidyX’) is a handy package for checking out available datasets, exploring the data dictionary in the Rstudio viewer and loading the desired data (and more). Although simple to use, I’m going to stick with the readr::read_csv() to import the data as a tibble (Tidyverse lingo for a modern dataframe object).

# Load the data with a readr function
crop_yields <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv") %>% 
  janitor::clean_names() 

land_use <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv") %>% 
  janitor::clean_names() %>% 
  select(entity, code, year, population = total_population_gapminder) %>% 
  mutate(year = as.numeric(year))

crop_yield <- crop_yields %>% 
  left_join(land_use, by = c('code',  'year', 'entity'))

skimr::skim(crop_yields)
Table 1: Data summary
Name crop_yields
Number of rows 13075
Number of columns 14
_______________________
Column type frequency:
character 2
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
entity 0 1.00 4 39 0 249 0
code 1919 0.85 3 8 0 214 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 1990.37 16.73 1961.00 1976.00 1991.00 2005.00 2018.00 ▇▆▇▇▇
wheat_tonnes_per_hectare 4974 0.62 2.43 1.69 0.00 1.23 1.99 3.12 10.67 ▇▅▂▁▁
rice_tonnes_per_hectare 4604 0.65 3.16 1.85 0.20 1.77 2.74 4.16 10.68 ▇▇▃▁▁
maize_tonnes_per_hectare 2301 0.82 3.02 3.13 0.03 1.14 1.83 3.92 36.76 ▇▁▁▁▁
soybeans_tonnes_per_hectare 7114 0.46 1.45 0.75 0.00 0.86 1.33 1.90 5.95 ▇▇▂▁▁
potatoes_tonnes_per_hectare 3059 0.77 15.40 9.29 0.84 8.64 13.41 20.05 75.30 ▇▅▁▁▁
beans_tonnes_per_hectare 5066 0.61 1.09 0.82 0.03 0.59 0.83 1.35 9.18 ▇▁▁▁▁
peas_tonnes_per_hectare 6840 0.48 1.48 1.01 0.04 0.72 1.15 1.99 7.16 ▇▃▁▁▁
cassava_tonnes_per_hectare 5887 0.55 9.34 5.11 1.00 5.55 8.67 11.99 38.58 ▇▇▁▁▁
barley_tonnes_per_hectare 6342 0.51 2.23 1.50 0.09 1.05 1.88 3.02 9.15 ▇▆▂▁▁
cocoa_beans_tonnes_per_hectare 8466 0.35 0.39 0.28 0.00 0.24 0.36 0.49 3.43 ▇▁▁▁▁
bananas_tonnes_per_hectare 4166 0.68 15.20 12.08 0.66 5.94 11.78 20.79 77.59 ▇▃▁▁▁
# Here are a few other options for getting to know the data
# glimpse(crop_yields) 
# summary(crop_yields)
# View(crop_yields)

Tidy data: cleaning and reshaping

Let’s start with a little data cleaning! skim() from skimr provides a convenient way to get summary statistics and more. By looking at the data we can see that there are missing values that need to be removed or imputed but other than that, the data is clean. One thing we should consider is whether the data is in a tidy format where each column is a feature and each row a sample. This format is important when we want to model the data but sometimes it’s helpful to pivot the data into a long format for creating visualizations across multiple categories with ggplot2::facet_wrap() and related functions. We will use the handy function, pivot_longer(), from the tidyr package to combine crops into a single column and present their respective values in a separate column in preparation for exploratory data analysis (EDA).

# Use pivot_longer from tidyr to tidy the data.
crop_yield_tidy <- crop_yield %>% 
  rename_all(str_remove, "_tonnes.*") %>% 
  rename(country = entity) %>% 
  pivot_longer(wheat:bananas, names_to = 'crop', values_to = 'yield_hectare') %>% 
  drop_na(yield_hectare) 

# Create a vector of the 25 countries with the largest population
top_pops <- land_use  %>%
  filter(!is.na(code), entity != "World") %>% 
  group_by(entity) %>% 
  filter(year == max(year)) %>% 
  ungroup() %>% 
  slice_max(population, n = 25) %>% 
  pull(entity)


kable(head(crop_yield_tidy, n = 10))
country code year population crop yield_hectare
Afghanistan AFG 1961 9169000 wheat 1.0220
Afghanistan AFG 1961 9169000 rice 1.5190
Afghanistan AFG 1961 9169000 maize 1.4000
Afghanistan AFG 1961 9169000 potatoes 8.6667
Afghanistan AFG 1961 9169000 barley 1.0800
Afghanistan AFG 1962 9351000 wheat 0.9735
Afghanistan AFG 1962 9351000 rice 1.5190
Afghanistan AFG 1962 9351000 maize 1.4000
Afghanistan AFG 1962 9351000 potatoes 7.6667
Afghanistan AFG 1962 9351000 barley 1.0800

What can we learn from this dataset?

A great place to start any analysis is by counting the representation of certain categories in the dataset using the count() or add_count() function. For instance, we might want to know the count of each country across years which will tell us the number of crops recorded for each country. It often makes sense to combine count() with a bar plot to visualize meaningful distributions.

EDA is often where I have the most fun in my workflow..plus it’s arguably the most important step in a data analysis pipeline. Among other things, it provides insights that will help us have an intuitive sense about what to expect when we generate machine learning models and evaluate their performance.

crop_yield_tidy %>%
  add_count(year, country, sort = T) %>% 
  filter(country %in% c("Albania", "Africa", "United States")) # Africa has 11 crops, US has 9 and Albania 6. 
## # A tibble: 1,522 × 7
##    country code   year population crop        yield_hectare     n
##    <chr>   <chr> <dbl>      <dbl> <chr>               <dbl> <int>
##  1 Africa  <NA>   1961  290214016 wheat               0.693    11
##  2 Africa  <NA>   1961  290214016 rice                1.55     11
##  3 Africa  <NA>   1961  290214016 maize               1.04     11
##  4 Africa  <NA>   1961  290214016 soybeans            0.376    11
##  5 Africa  <NA>   1961  290214016 potatoes            8.18     11
##  6 Africa  <NA>   1961  290214016 beans               0.587    11
##  7 Africa  <NA>   1961  290214016 peas                0.679    11
##  8 Africa  <NA>   1961  290214016 cassava             5.66     11
##  9 Africa  <NA>   1961  290214016 barley              0.446    11
## 10 Africa  <NA>   1961  290214016 cocoa_beans         0.254    11
## # … with 1,512 more rows
crop_yield_tidy %>%
  count(country, sort = TRUE) %>%
  filter(n >=500) # this gives a list of the countries with the most years + crops 11 (crops) * 51 (year) = 561
## # A tibble: 53 × 2
##    country                               n
##    <chr>                             <int>
##  1 Africa                              638
##  2 Americas                            638
##  3 Asia                                638
##  4 Central America                     638
##  5 Democratic Republic of Congo        638
##  6 Eastern Africa                      638
##  7 Ecuador                             638
##  8 Land Locked Developing Countries    638
##  9 Least Developed Countries           638
## 10 Low Income Food Deficit Countries   638
## # … with 43 more rows
crop_yield_tidy %>%
  summarize(max(year)- min(year)) #57 years of recorded data
## # A tibble: 1 × 1
##   `max(year) - min(year)`
##                     <dbl>
## 1                      57
crop_yield_tidy %>% # 249 unique countries
  distinct(country) %>% 
  nrow()
## [1] 249

Visual analysis

Let’s start by comparing crop yield by country for the United States to see how crop yields have changed over time. While we’re at it, let’s look at the most abundant crop yields overall.

# Potatoes!
us_yields <- crop_yield_tidy %>% 
  filter(country == 'United States') %>% 
  ggplot(aes(year, yield_hectare, color = crop)) +
  geom_line() +
  facet_wrap(~country, 'free_x') +
  theme(plot.title.position = 'plot') +
  labs(
    title = 'Crop Yield USA',
    y = 'Year',
    x = 'tonnes per hectare per year'
  )

us_yields

# plot of USA data only
crop_yield_tidy %>% 
  filter(country == 'United States') %>% 
  mutate(crop = fct_reorder(crop, yield_hectare)) %>%
  ggplot(aes(yield_hectare, crop, fill = crop)) +
  geom_col()  +
  facet_wrap(~country, scales = 'free_y') +
  theme(legend.position =  'none',
        plot.title.position = 'plot') +
  labs(
    title = 'Crop Yield USA',
    y = 'Crop',
    x = 'tonnes per hectare'
  )

There are too many countries to add to a single plot, so we will likely need to decide how we want to subset the data. For this analysis, let’s use dplyr::slice_sample() to randomly select a subset of the total countries. Unless a seed is set, the selection will be different each time we run the code chunk. To start, let’s choose 9 countries and visualize the change in crop yields over time.

# set.seed(1014) if you want to select the same 9 countries each time

# Randomly select 9 country names
crops <- crop_yield_tidy %>% distinct(crop) %>%  pull()

country_random <- crop_yield_tidy %>% 
  select(country) %>% 
  distinct() %>% 
  slice_sample(n = 9) %>% 
  pull()

# Plot crop yield per hectare per year.
crop_yield_tidy %>% 
  filter(country %in% country_random,
         crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>% 
  mutate(crop = fct_reorder(crop, yield_hectare)) %>% 
  ggplot(aes(year, yield_hectare, color = crop)) +
  geom_line(size = 1, alpha = 0.5) +
  facet_wrap(~country, scales = 'free_y') +
  theme(plot.title.position = 'plot') +
  labs(
     y = 'yield per hectare',
    title = 'Crop Yield'
  )

crop_yield_tidy %>% 
  filter(country %in% top_pops,
         crop == c('maize', 'potatoes', 'wheat', 'bananas')) %>% 
  mutate(crop = reorder_within(crop, yield_hectare, country)) %>%
  ggplot(aes(yield_hectare, crop, fill = crop)) +
  geom_col()  +
  scale_y_reordered() +
  facet_wrap(~country, scales = 'free') +
  theme(legend.position = 'none',
        plot.title.position = 'plot') +
  labs(
    title = 'Crop Yield',
    y = 'Crop',
    x = 'yield per hectare'
  )

crop_yield_tidy %>% 
  filter(country %in% top_pops,
         crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>% 
  mutate(crop = fct_reorder(crop, yield_hectare)) %>% 
  ggplot(aes(year, yield_hectare, color = crop)) +
  geom_line(size = 1, alpha = 0.5) +
  theme(plot.title.position = 'plot') +
  facet_wrap(~country, scales = 'free_y') +
  labs(
     y = 'yield per hectare',
    title = 'Crop Yield'
  )

Next, let’s make a regression model for each of the top 25 countries.

library(broom)
library(ggrepel)

many_models <- crop_yield_tidy %>%
  drop_na() %>%
  filter(country %in% top_pops,
         crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
  nest(mod_data = c(year, yield_hectare, population)) %>%
  mutate(model = map(mod_data, ~lm(yield_hectare ~ population + year, data = .))) %>%
  mutate(coefs = map(model, tidy)) %>%
  unnest(coefs)

Here’s what the data looks like after unnesting the coefficients.

many_models %>% 
  select(country, code, crop, term, estimate, std.error, statistic, p.value, everything()) %>% 
  head()
## # A tibble: 6 × 10
##   country    code  crop  term    estimate std.error statistic  p.value
##   <chr>      <chr> <chr> <chr>      <dbl>     <dbl>     <dbl>    <dbl>
## 1 Bangladesh BGD   wheat (Inter… -4.02e+2   4.99e+1     -8.06 6.96e-11
## 2 Bangladesh BGD   wheat popula… -8.05e-8   1.23e-8     -6.56 1.95e- 8
## 3 Bangladesh BGD   wheat year     2.07e-1   2.57e-2      8.06 7.01e-11
## 4 Bangladesh BGD   maize (Inter…  5.31e+2   2.63e+2      2.02 4.85e- 2
## 5 Bangladesh BGD   maize popula…  1.90e-7   6.47e-8      2.94 4.84e- 3
## 6 Bangladesh BGD   maize year    -2.76e-1   1.36e-1     -2.03 4.71e- 2
## # … with 2 more variables: mod_data <list>, model <list>

Finally, let’s make the results more meaningful by creating a visualization to summarize the model results.

many_models %>%
  filter(term == "year") %>%
  mutate(p.value = p.adjust(p.value)) %>% 
  ggplot(aes(estimate, p.value, label = country)) +
  geom_vline(xintercept = 0, lty = 2, 
               size = 1.5, alpha = 0.7, color = "gray50") +
  geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
  scale_y_log10() +
  facet_wrap(~crop) +
  geom_text_repel(size = 2.5, family = "Times New Roman") 

Gabriel Mednick, PhD
Gabriel Mednick, PhD

Data scientist, biochemist, computational biologist, educator, life enthusiast