Text Mining with transcripts of The Office

Jun 5, 2021

library(tidyverse)
library(tidytext)
theme_set(theme_light())

In this post we are going to explore the transcripts of The Office using the tidytext package for text analysis by Julia Silge and David Robinson. Let’s start off by loading the data from the schrute package and taking a quick look at it.

#install.packages('schrute')

library(schrute)
office_df <- theoffice
glimpse(office_df)

## Rows: 55,130
## Columns: 12
## $ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
## $ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
## $ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
## $ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
## $ text             <chr> "All right Jim. Your quarterlies look very good. How …
## $ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
## $ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
## $ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
## $ air_date         <fct> 2005-03-24, 2005-03-24, 2005-03-24, 2005-03-24, 2005-…

office_df %>% 
  slice(1:3) %>%
  knitr::kable()

index	season	episode	episode_name	director	writer	character	text	text_w_direction	imdb_rating	total_votes	air_date
1	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Michael	All right Jim. Your quarterlies look very good. How are things at the library?	All right Jim. Your quarterlies look very good. How are things at the library?	7.6	3706	2005-03-24
2	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Jim	Oh, I told you. I couldn’t close it. So…	Oh, I told you. I couldn’t close it. So…	7.6	3706	2005-03-24
3	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Michael	So you’ve come to the master for guidance? Is this what you’re saying, grasshopper?	So you’ve come to the master for guidance? Is this what you’re saying, grasshopper?	7.6	3706	2005-03-24

The data is in a tidy format where each column is a feature and each row an observation. The text column contains each characters’ lines for all nine seasons. It also contains the IMDB ratings for each episode. Let’s start off by visualizing the average ratings for each season.

office_df %>% 
  group_by(season) %>% 
  mutate(avgIMDB = mean(imdb_rating)) %>% 
  ggplot(aes(season, imdb_rating, group = season, fill = factor(season))) +
  geom_boxplot() +
  theme(legend.position = 'none') +
  labs(title = "Average IMDB rading by season",
       x = "Season", 
       y = "AVG IMDB rating") +
  scale_x_continuous(breaks = 1:9)

I have only watched a few episodes of The Office but I do know that Michael left in season 7. From the boxplots, we can see that the popularity peaks in season 3 and 4 and the popularity drops off in the last three seasons. Next, let’s consider the IMDB rating by episode.

office <- office_df %>% 
  select(season, episode, episode_name, imdb_rating, total_votes, air_date) %>%
  mutate(air_date = as.Date(air_date)) %>% 
  distinct(., air_date, .keep_all = TRUE)

p <- office %>% 
  mutate(episode_num = row_number()) %>% 
  ggplot(aes(episode_num, imdb_rating, group = 1, color = factor(season))) +
  geom_line() + 
  geom_smooth(se = FALSE) +
  geom_point(aes(size = total_votes)) +
  #geom_text(aes(label = episode_name), check_overlap = TRUE, hjust = 1) +
  theme(legend.position = 'none') +
  labs(title = 'Complete IMDB ratings for the office',
       subtitle = 'point size ~ # of ratings',
       x = 'Episode',
       y = 'IMDB rating') +
  ggplot2::aes(text = paste0('Episode name: ', episode_name,
                             '\nIMDB rating: ', imdb_rating,
                             '\ntotal votes: ', format(total_votes, big.mark = ','),
                             '\nseason: ', season)
  )

plotly::ggplotly(p, tooltip = 'text')

In the interactive plot above, we can see a more nuanced picture of episode ratings. The drop-off in popularity after Michael left is much more obvious.

For the NLP analysis, we will keep only the top 20 characters based on their respective line counts.

top_characters <- office_df %>% 
  count(character, sort = T) %>% 
  slice(1:20) %>% 
  mutate(character = fct_reorder(character, n))

top_characters %>% 
  ggplot(aes(n, character, fill = n)) +
  geom_col() +
  theme(legend.position = 'none')

Now we are ready to explore the text. Let’s start off by using `unnest_tokens()`, the gateway function for NLP analysis with the tidytext package. Unnest_tokens converts lines of text into individual or multi-word tokens. We will start by converting the text so that every word is an observation. We will then use Ggplot2 to visualize the 8 most frequent word for each character. The tidytext package contains a data frame of stop words that can be used to filter out words that do not convey meaning, such as ‘the’, ‘and’, etc. I will also filter out character names and a few other words that were not included in the standard stop words.

junk_words <- c('yeah', 'hey', 'uh', 'um')
names_blacklist <- top_characters %>% 
  select(character) %>% 
  as.character()

words_df <- office_df %>% 
  unnest_tokens('word', 'text') %>%
  inner_join(top_characters %>% 
               select(character), by = 'character') %>% 
  anti_join(stop_words, by = 'word') %>%
  filter(!word %in% junk_words,
         !word %in% names_blacklist) %>% 
  count(character, word) %>% 
  group_by(character) %>% 
  slice_max(n, n = 8) 

words_df %>% 
  ungroup() %>% 
  slice_head(n=3) %>%
  knitr::kable()

character	word	n
Andy	gonna	130
Andy	guys	118
Andy	time	102

words_df %>% 
  mutate(word = reorder_within(word, n, character)) %>% 
  ggplot(aes(x = word, y = n, fill = character)) +
  geom_col() + 
  scale_x_reordered() + 
  coord_flip() + 
  facet_wrap(~character, scales = "free") + 
  theme(legend.position = "none")

The function, `bind_tf_idf()` selects tokens that are unique to a character. According to Wikipedia, ‘Term frequency–inverse document frequency (TFIDF) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus’. The description goes on to say that variations of TFIDF are commonly use to rank search engine query results.

words_df %>% 
  bind_tf_idf(word, character, n) %>% 
  group_by(character) %>% 
  top_n(tf_idf, n = 5) %>% 
  ungroup() %>% 
  ggplot(aes(x = reorder_within(word, tf_idf, character), y = tf_idf, fill = character)) +
  geom_col() + 
  scale_x_reordered() + 
  coord_flip() + 
  facet_wrap(~character, scales = "free") + 
  theme(legend.position = "none") +
  labs(x = "",
       y = "TF-IDF of character-word pairs")

Sometimes a series of words in sequence might convey more meaning. The token number is referred to as n-grams. In the example below we will consider a tri-gram tokeniztion.

trigram_tokens <- office_df %>% 
  unnest_tokens('word', 'text', token = "ngrams", n = 3) %>%
  inner_join(top_characters %>% 
               select(character), by = 'character') %>% 
  anti_join(stop_words, by = 'word') %>%
  filter(!word %in% junk_words,
         !word %in% names_blacklist) %>% 
  count(character, word) %>% 
  group_by(character) %>% 
  slice_max(n, n = 8) 

trigram_tokens %>% 
  ungroup() %>% 
  slice_head(n=4) %>%
  knitr::kable()

character	word	n
Andy	NA	681
Andy	i don’t know	42
Andy	no no no	39
Andy	you know what	26

trigram_tokens %>% 
  bind_tf_idf(word, character, n) %>% 
  group_by(character) %>% 
  top_n(tf_idf, n = 5) %>% 
  ungroup() %>% 
  ggplot(aes(x = reorder_within(word, tf_idf, character), y = tf_idf, fill = character)) +
  geom_col() + 
  scale_x_reordered() + 
  coord_flip() + 
  facet_wrap(~character, scales = "free") + 
  theme(legend.position = "none") +
  labs(x = "",
       y = "TF-IDF of character-word pairs")

The overall sentiment of The Office can be interpretted using the `get_sentiments()` function. In this example, we are using the bing lexicon to compare positive and negative sentiment over all 9 seasons of the show.

office_df %>% 
  unnest_tokens('word', 'text') %>%
  inner_join(top_characters %>% 
               select(character), by = 'character') %>% 
  anti_join(stop_words, by = 'word') %>%
  filter(!word %in% junk_words,
         !word %in% names_blacklist) %>% 
  select(season, word) %>% 
  inner_join(get_sentiments(lexicon = "bing")) %>% 
  group_by(season) %>% 
  count(sentiment) %>% 
  ungroup() %>% 
  ggplot(aes(x = season, y = n, fill = sentiment, group = sentiment)) + 
  geom_area() +
  scale_x_continuous(breaks = 1:9) +
  labs(title = 'Area plot of positive and negative sentiments over seasons')

One last quick visual that we might want to summon is the wordcloud. Wordclouds use font size to emphasize the frequency of each word in the document. In the examples below, I will consider a general word cloud, as well as look specific examples based on individual characters.

library(RColorBrewer)
library(wordcloud)
pal <- brewer.pal(8,"Dark2")

words_expanded <- office_df %>% 
  unnest_tokens('word', 'text') %>%
  inner_join(top_characters %>% 
               select(character), by = 'character') %>% 
  anti_join(stop_words, by = 'word') %>%
  filter(!word %in% junk_words,
         !word %in% names_blacklist) %>% 
  count(character, word) %>% 
  group_by(character) %>% 
  slice_max(n, n = 50)

Most words in The Office

words_expanded %>%
  ungroup() %>% 
  with(wordcloud(word, n, random.order = FALSE, max.words = 50, colors=pal))

Most common words for Jim

words_expanded %>%
  ungroup() %>% 
  filter(character == "Jim") %>% 
  with(wordcloud(word, n, random.order = FALSE, max.words = 50, colors=pal))

Most common words for Pam

words_expanded %>%
  ungroup() %>% 
  filter(character == "Pam") %>% 
  with(wordcloud(word, n, random.order = FALSE, max.words = 50, colors=pal))

This post was a lot of fun to work on and I hope to do more NLP analysis in the future. NLP provides a powerful way to condense large amounts of text data into easily interpretable insights. The Tidytext package is a great way to work with text and complements the Tidyverse ecosystem.

Gabriel Mednick, PhD

Data scientist, biochemist, computational biologist, educator, life enthusiast

Text Mining with transcripts of The Office

In this post we are going to explore the transcripts of The Office using the tidytext package for text analysis by Julia Silge and David Robinson. Let’s start off by loading the data from the schrute package and taking a quick look at it.

The data is in a tidy format where each column is a feature and each row an observation. The text column contains each characters’ lines for all nine seasons. It also contains the IMDB ratings for each episode. Let’s start off by visualizing the average ratings for each season.

I have only watched a few episodes of The Office but I do know that Michael left in season 7. From the boxplots, we can see that the popularity peaks in season 3 and 4 and the popularity drops off in the last three seasons. Next, let’s consider the IMDB rating by episode.

In the interactive plot above, we can see a more nuanced picture of episode ratings. The drop-off in popularity after Michael left is much more obvious.

For the NLP analysis, we will keep only the top 20 characters based on their respective line counts.

Sometimes a series of words in sequence might convey more meaning. The token number is referred to as n-grams. In the example below we will consider a tri-gram tokeniztion.

The overall sentiment of The Office can be interpretted using the get_sentiments() function. In this example, we are using the bing lexicon to compare positive and negative sentiment over all 9 seasons of the show.

One last quick visual that we might want to summon is the wordcloud. Wordclouds use font size to emphasize the frequency of each word in the document. In the examples below, I will consider a general word cloud, as well as look specific examples based on individual characters.

This post was a lot of fun to work on and I hope to do more NLP analysis in the future. NLP provides a powerful way to condense large amounts of text data into easily interpretable insights. The Tidytext package is a great way to work with text and complements the Tidyverse ecosystem.

Gabriel Mednick, PhD

The overall sentiment of The Office can be interpretted using the `get_sentiments()` function. In this example, we are using the bing lexicon to compare positive and negative sentiment over all 9 seasons of the show.