Hello Tidyverse and generating many linear regression models

Always a work in progress

Last updated on Oct 20, 2020

Intro

This post uses data from the for Data Science online learning community. Each Tuesday, a dataset is posted and people share their insights and visualizations like fine wine. You can Twitter the hashtag #TidyTuesday to sample the vintage of the week. This week’s dataset (September 1st, 2020) is from Our World in Data. There are actually five datasets but we are going to focus on global crop yields.

After exploring the data, we will create a model that predict the yield per hectare (yield_hectare ~ population + year), using the year and population. The dataset is too big to use every country and crop we will choose the top 25 countries by population and select 4 common crops. Altother, that will produce 100 linear models. Since that’s too much to digest in a table, we will present the model results as a labeled data visualization where each point represents a model’s adjusted p-value (y-axis) and estimate or slope (x-axis) for a given country and crop.

Download and view the data

Let’s start by downloading the data from the R4DS Github account and take a look at it. The tidytuesdayR package (by Ellis Hughes, co-host of the screencast ‘TidyX’) is a handy package for checking out available datasets, exploring the data dictionary in the Rstudio viewer and loading the desired data (and more). Although simple to use, I’m going to stick with the readr::read_csv() to import the data as a tibble (Tidyverse lingo for a modern dataframe object).

# Load the data with a readr function
crop_yields <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv") %>% 
  janitor::clean_names() 

land_use <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv") %>% 
  janitor::clean_names() %>% 
  select(entity, code, year, population = total_population_gapminder) %>% 
  mutate(year = as.numeric(year))

crop_yield <- crop_yields %>% 
  left_join(land_use, by = c('code',  'year', 'entity'))

skimr::skim(crop_yields)

Table 1: Data summary
Name	crop_yields
Number of rows	13075
Number of columns	14
_______________________
Column type frequency:
character	2
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
entity	0	1.00	4	39	0	249	0
code	1919	0.85	3	8	0	214	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	1990.37	16.73	1961.00	1976.00	1991.00	2005.00	2018.00	▇▆▇▇▇
wheat_tonnes_per_hectare	4974	0.62	2.43	1.69	0.00	1.23	1.99	3.12	10.67	▇▅▂▁▁
rice_tonnes_per_hectare	4604	0.65	3.16	1.85	0.20	1.77	2.74	4.16	10.68	▇▇▃▁▁
maize_tonnes_per_hectare	2301	0.82	3.02	3.13	0.03	1.14	1.83	3.92	36.76	▇▁▁▁▁
soybeans_tonnes_per_hectare	7114	0.46	1.45	0.75	0.00	0.86	1.33	1.90	5.95	▇▇▂▁▁
potatoes_tonnes_per_hectare	3059	0.77	15.40	9.29	0.84	8.64	13.41	20.05	75.30	▇▅▁▁▁
beans_tonnes_per_hectare	5066	0.61	1.09	0.82	0.03	0.59	0.83	1.35	9.18	▇▁▁▁▁
peas_tonnes_per_hectare	6840	0.48	1.48	1.01	0.04	0.72	1.15	1.99	7.16	▇▃▁▁▁
cassava_tonnes_per_hectare	5887	0.55	9.34	5.11	1.00	5.55	8.67	11.99	38.58	▇▇▁▁▁
barley_tonnes_per_hectare	6342	0.51	2.23	1.50	0.09	1.05	1.88	3.02	9.15	▇▆▂▁▁
cocoa_beans_tonnes_per_hectare	8466	0.35	0.39	0.28	0.00	0.24	0.36	0.49	3.43	▇▁▁▁▁
bananas_tonnes_per_hectare	4166	0.68	15.20	12.08	0.66	5.94	11.78	20.79	77.59	▇▃▁▁▁

# Here are a few other options for getting to know the data
# glimpse(crop_yields) 
# summary(crop_yields)
# View(crop_yields)

Tidy data: cleaning and reshaping

Let’s start with a little data cleaning! skim() from skimr provides a convenient way to get summary statistics and more. By looking at the data we can see that there are missing values that need to be removed or imputed but other than that, the data is clean. One thing we should consider is whether the data is in a tidy format where each column is a feature and each row a sample. This format is important when we want to model the data but sometimes it’s helpful to pivot the data into a long format for creating visualizations across multiple categories with ggplot2::facet_wrap() and related functions. We will use the handy function, pivot_longer(), from the tidyr package to combine crops into a single column and present their respective values in a separate column in preparation for exploratory data analysis (EDA).

# Use pivot_longer from tidyr to tidy the data.
crop_yield_tidy <- crop_yield %>% 
  rename_all(str_remove, "_tonnes.*") %>% 
  rename(country = entity) %>% 
  pivot_longer(wheat:bananas, names_to = 'crop', values_to = 'yield_hectare') %>% 
  drop_na(yield_hectare) 

# Create a vector of the 25 countries with the largest population
top_pops <- land_use  %>%
  filter(!is.na(code), entity != "World") %>% 
  group_by(entity) %>% 
  filter(year == max(year)) %>% 
  ungroup() %>% 
  slice_max(population, n = 25) %>% 
  pull(entity)


kable(head(crop_yield_tidy, n = 10))

country	code	year	population	crop	yield_hectare
Afghanistan	AFG	1961	9169000	wheat	1.0220
Afghanistan	AFG	1961	9169000	rice	1.5190
Afghanistan	AFG	1961	9169000	maize	1.4000
Afghanistan	AFG	1961	9169000	potatoes	8.6667
Afghanistan	AFG	1961	9169000	barley	1.0800
Afghanistan	AFG	1962	9351000	wheat	0.9735
Afghanistan	AFG	1962	9351000	rice	1.5190
Afghanistan	AFG	1962	9351000	maize	1.4000
Afghanistan	AFG	1962	9351000	potatoes	7.6667
Afghanistan	AFG	1962	9351000	barley	1.0800

What can we learn from this dataset?

A great place to start any analysis is by counting the representation of certain categories in the dataset using the count() or add_count() function. For instance, we might want to know the count of each country across years which will tell us the number of crops recorded for each country. It often makes sense to combine count() with a bar plot to visualize meaningful distributions.

EDA is often where I have the most fun in my workflow..plus it’s arguably the most important step in a data analysis pipeline. Among other things, it provides insights that will help us have an intuitive sense about what to expect when we generate machine learning models and evaluate their performance.

crop_yield_tidy %>%
  add_count(year, country, sort = T) %>% 
  filter(country %in% c("Albania", "Africa", "United States")) # Africa has 11 crops, US has 9 and Albania 6.

## # A tibble: 1,522 × 7
##    country code   year population crop        yield_hectare     n
##    <chr>   <chr> <dbl>      <dbl> <chr>               <dbl> <int>
##  1 Africa  <NA>   1961  290214016 wheat               0.693    11
##  2 Africa  <NA>   1961  290214016 rice                1.55     11
##  3 Africa  <NA>   1961  290214016 maize               1.04     11
##  4 Africa  <NA>   1961  290214016 soybeans            0.376    11
##  5 Africa  <NA>   1961  290214016 potatoes            8.18     11
##  6 Africa  <NA>   1961  290214016 beans               0.587    11
##  7 Africa  <NA>   1961  290214016 peas                0.679    11
##  8 Africa  <NA>   1961  290214016 cassava             5.66     11
##  9 Africa  <NA>   1961  290214016 barley              0.446    11
## 10 Africa  <NA>   1961  290214016 cocoa_beans         0.254    11
## # … with 1,512 more rows

crop_yield_tidy %>%
  count(country, sort = TRUE) %>%
  filter(n >=500) # this gives a list of the countries with the most years + crops 11 (crops) * 51 (year) = 561

## # A tibble: 53 × 2
##    country                               n
##    <chr>                             <int>
##  1 Africa                              638
##  2 Americas                            638
##  3 Asia                                638
##  4 Central America                     638
##  5 Democratic Republic of Congo        638
##  6 Eastern Africa                      638
##  7 Ecuador                             638
##  8 Land Locked Developing Countries    638
##  9 Least Developed Countries           638
## 10 Low Income Food Deficit Countries   638
## # … with 43 more rows

crop_yield_tidy %>%
  summarize(max(year)- min(year)) #57 years of recorded data

## # A tibble: 1 × 1
##   `max(year) - min(year)`
##                     <dbl>
## 1                      57

crop_yield_tidy %>% # 249 unique countries
  distinct(country) %>% 
  nrow()

## [1] 249

Visual analysis

Let’s start by comparing crop yield by country for the United States to see how crop yields have changed over time. While we’re at it, let’s look at the most abundant crop yields overall.

# Potatoes!
us_yields <- crop_yield_tidy %>% 
  filter(country == 'United States') %>% 
  ggplot(aes(year, yield_hectare, color = crop)) +
  geom_line() +
  facet_wrap(~country, 'free_x') +
  theme(plot.title.position = 'plot') +
  labs(
    title = 'Crop Yield USA',
    y = 'Year',
    x = 'tonnes per hectare per year'
  )

us_yields

# plot of USA data only
crop_yield_tidy %>% 
  filter(country == 'United States') %>% 
  mutate(crop = fct_reorder(crop, yield_hectare)) %>%
  ggplot(aes(yield_hectare, crop, fill = crop)) +
  geom_col()  +
  facet_wrap(~country, scales = 'free_y') +
  theme(legend.position =  'none',
        plot.title.position = 'plot') +
  labs(
    title = 'Crop Yield USA',
    y = 'Crop',
    x = 'tonnes per hectare'
  )

There are too many countries to add to a single plot, so we will likely need to decide how we want to subset the data. For this analysis, let’s use dplyr::slice_sample() to randomly select a subset of the total countries. Unless a seed is set, the selection will be different each time we run the code chunk. To start, let’s choose 9 countries and visualize the change in crop yields over time.

# set.seed(1014) if you want to select the same 9 countries each time

# Randomly select 9 country names
crops <- crop_yield_tidy %>% distinct(crop) %>%  pull()

country_random <- crop_yield_tidy %>% 
  select(country) %>% 
  distinct() %>% 
  slice_sample(n = 9) %>% 
  pull()

# Plot crop yield per hectare per year.
crop_yield_tidy %>% 
  filter(country %in% country_random,
         crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>% 
  mutate(crop = fct_reorder(crop, yield_hectare)) %>% 
  ggplot(aes(year, yield_hectare, color = crop)) +
  geom_line(size = 1, alpha = 0.5) +
  facet_wrap(~country, scales = 'free_y') +
  theme(plot.title.position = 'plot') +
  labs(
     y = 'yield per hectare',
    title = 'Crop Yield'
  )

crop_yield_tidy %>% 
  filter(country %in% top_pops,
         crop == c('maize', 'potatoes', 'wheat', 'bananas')) %>% 
  mutate(crop = reorder_within(crop, yield_hectare, country)) %>%
  ggplot(aes(yield_hectare, crop, fill = crop)) +
  geom_col()  +
  scale_y_reordered() +
  facet_wrap(~country, scales = 'free') +
  theme(legend.position = 'none',
        plot.title.position = 'plot') +
  labs(
    title = 'Crop Yield',
    y = 'Crop',
    x = 'yield per hectare'
  )

crop_yield_tidy %>% 
  filter(country %in% top_pops,
         crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>% 
  mutate(crop = fct_reorder(crop, yield_hectare)) %>% 
  ggplot(aes(year, yield_hectare, color = crop)) +
  geom_line(size = 1, alpha = 0.5) +
  theme(plot.title.position = 'plot') +
  facet_wrap(~country, scales = 'free_y') +
  labs(
     y = 'yield per hectare',
    title = 'Crop Yield'
  )

Next, let’s make a regression model for each of the top 25 countries.

library(broom)
library(ggrepel)

many_models <- crop_yield_tidy %>%
  drop_na() %>%
  filter(country %in% top_pops,
         crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
  nest(mod_data = c(year, yield_hectare, population)) %>%
  mutate(model = map(mod_data, ~lm(yield_hectare ~ population + year, data = .))) %>%
  mutate(coefs = map(model, tidy)) %>%
  unnest(coefs)

Here’s what the data looks like after unnesting the coefficients.

many_models %>% 
  select(country, code, crop, term, estimate, std.error, statistic, p.value, everything()) %>% 
  head()

## # A tibble: 6 × 10
##   country    code  crop  term    estimate std.error statistic  p.value
##   <chr>      <chr> <chr> <chr>      <dbl>     <dbl>     <dbl>    <dbl>
## 1 Bangladesh BGD   wheat (Inter… -4.02e+2   4.99e+1     -8.06 6.96e-11
## 2 Bangladesh BGD   wheat popula… -8.05e-8   1.23e-8     -6.56 1.95e- 8
## 3 Bangladesh BGD   wheat year     2.07e-1   2.57e-2      8.06 7.01e-11
## 4 Bangladesh BGD   maize (Inter…  5.31e+2   2.63e+2      2.02 4.85e- 2
## 5 Bangladesh BGD   maize popula…  1.90e-7   6.47e-8      2.94 4.84e- 3
## 6 Bangladesh BGD   maize year    -2.76e-1   1.36e-1     -2.03 4.71e- 2
## # … with 2 more variables: mod_data <list>, model <list>

Finally, let’s make the results more meaningful by creating a visualization to summarize the model results.

many_models %>%
  filter(term == "year") %>%
  mutate(p.value = p.adjust(p.value)) %>% 
  ggplot(aes(estimate, p.value, label = country)) +
  geom_vline(xintercept = 0, lty = 2, 
               size = 1.5, alpha = 0.7, color = "gray50") +
  geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
  scale_y_log10() +
  facet_wrap(~crop) +
  geom_text_repel(size = 2.5, family = "Times New Roman")

Gabriel Mednick, PhD

Data scientist, biochemist, computational biologist, educator, life enthusiast