Text Mining with transcripts of The Office
library(tidyverse)
library(tidytext)
theme_set(theme_light())
In this post we are going to explore the transcripts of The Office using the tidytext package for text analysis by Julia Silge and David Robinson. Let’s start off by loading the data from the schrute package and taking a quick look at it.
#install.packages('schrute')
library(schrute)
office_df <- theoffice
glimpse(office_df)
## Rows: 55,130
## Columns: 12
## $ index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ episode_name <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
## $ director <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
## $ writer <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
## $ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
## $ text <chr> "All right Jim. Your quarterlies look very good. How …
## $ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
## $ imdb_rating <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
## $ total_votes <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
## $ air_date <fct> 2005-03-24, 2005-03-24, 2005-03-24, 2005-03-24, 2005-…
office_df %>%
slice(1:3) %>%
knitr::kable()
index | season | episode | episode_name | director | writer | character | text | text_w_direction | imdb_rating | total_votes | air_date |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | All right Jim. Your quarterlies look very good. How are things at the library? | 7.6 | 3706 | 2005-03-24 |
2 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Jim | Oh, I told you. I couldn’t close it. So… | Oh, I told you. I couldn’t close it. So… | 7.6 | 3706 | 2005-03-24 |
3 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Michael | So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? | So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? | 7.6 | 3706 | 2005-03-24 |
The data is in a tidy format where each column is a feature and each row an observation. The text column contains each characters’ lines for all nine seasons. It also contains the IMDB ratings for each episode. Let’s start off by visualizing the average ratings for each season.
office_df %>%
group_by(season) %>%
mutate(avgIMDB = mean(imdb_rating)) %>%
ggplot(aes(season, imdb_rating, group = season, fill = factor(season))) +
geom_boxplot() +
theme(legend.position = 'none') +
labs(title = "Average IMDB rading by season",
x = "Season",
y = "AVG IMDB rating") +
scale_x_continuous(breaks = 1:9)
I have only watched a few episodes of The Office but I do know that Michael left in season 7. From the boxplots, we can see that the popularity peaks in season 3 and 4 and the popularity drops off in the last three seasons. Next, let’s consider the IMDB rating by episode.
office <- office_df %>%
select(season, episode, episode_name, imdb_rating, total_votes, air_date) %>%
mutate(air_date = as.Date(air_date)) %>%
distinct(., air_date, .keep_all = TRUE)
p <- office %>%
mutate(episode_num = row_number()) %>%
ggplot(aes(episode_num, imdb_rating, group = 1, color = factor(season))) +
geom_line() +
geom_smooth(se = FALSE) +
geom_point(aes(size = total_votes)) +
#geom_text(aes(label = episode_name), check_overlap = TRUE, hjust = 1) +
theme(legend.position = 'none') +
labs(title = 'Complete IMDB ratings for the office',
subtitle = 'point size ~ # of ratings',
x = 'Episode',
y = 'IMDB rating') +
ggplot2::aes(text = paste0('Episode name: ', episode_name,
'\nIMDB rating: ', imdb_rating,
'\ntotal votes: ', format(total_votes, big.mark = ','),
'\nseason: ', season)
)
plotly::ggplotly(p, tooltip = 'text')
In the interactive plot above, we can see a more nuanced picture of episode ratings. The drop-off in popularity after Michael left is much more obvious.
For the NLP analysis, we will keep only the top 20 characters based on their respective line counts.
top_characters <- office_df %>%
count(character, sort = T) %>%
slice(1:20) %>%
mutate(character = fct_reorder(character, n))
top_characters %>%
ggplot(aes(n, character, fill = n)) +
geom_col() +
theme(legend.position = 'none')
Now we are ready to explore the text. Let’s start off by using unnest_tokens()
, the gateway function for NLP analysis with the tidytext package. Unnest_tokens converts lines of text into individual or multi-word tokens. We will start by converting the text so that every word is an observation. We will then use Ggplot2 to visualize the 8 most frequent word for each character. The tidytext package contains a data frame of stop words that can be used to filter out words that do not convey meaning, such as ‘the’, ‘and’, etc. I will also filter out character names and a few other words that were not included in the standard stop words.
junk_words <- c('yeah', 'hey', 'uh', 'um')
names_blacklist <- top_characters %>%
select(character) %>%
as.character()
words_df <- office_df %>%
unnest_tokens('word', 'text') %>%
inner_join(top_characters %>%
select(character), by = 'character') %>%
anti_join(stop_words, by = 'word') %>%
filter(!word %in% junk_words,
!word %in% names_blacklist) %>%
count(character, word) %>%
group_by(character) %>%
slice_max(n, n = 8)
words_df %>%
ungroup() %>%
slice_head(n=3) %>%
knitr::kable()
character | word | n |
---|---|---|
Andy | gonna | 130 |
Andy | guys | 118 |
Andy | time | 102 |
words_df %>%
mutate(word = reorder_within(word, n, character)) %>%
ggplot(aes(x = word, y = n, fill = character)) +
geom_col() +
scale_x_reordered() +
coord_flip() +
facet_wrap(~character, scales = "free") +
theme(legend.position = "none")
The function, bind_tf_idf()
selects tokens that are unique to a character. According to Wikipedia, ‘Term frequency–inverse document frequency (TFIDF) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus’. The description goes on to say that variations of TFIDF are commonly use to rank search engine query results.
words_df %>%
bind_tf_idf(word, character, n) %>%
group_by(character) %>%
top_n(tf_idf, n = 5) %>%
ungroup() %>%
ggplot(aes(x = reorder_within(word, tf_idf, character), y = tf_idf, fill = character)) +
geom_col() +
scale_x_reordered() +
coord_flip() +
facet_wrap(~character, scales = "free") +
theme(legend.position = "none") +
labs(x = "",
y = "TF-IDF of character-word pairs")
Sometimes a series of words in sequence might convey more meaning. The token number is referred to as n-grams. In the example below we will consider a tri-gram tokeniztion.
trigram_tokens <- office_df %>%
unnest_tokens('word', 'text', token = "ngrams", n = 3) %>%
inner_join(top_characters %>%
select(character), by = 'character') %>%
anti_join(stop_words, by = 'word') %>%
filter(!word %in% junk_words,
!word %in% names_blacklist) %>%
count(character, word) %>%
group_by(character) %>%
slice_max(n, n = 8)
trigram_tokens %>%
ungroup() %>%
slice_head(n=4) %>%
knitr::kable()
character | word | n |
---|---|---|
Andy | NA | 681 |
Andy | i don’t know | 42 |
Andy | no no no | 39 |
Andy | you know what | 26 |
trigram_tokens %>%
bind_tf_idf(word, character, n) %>%
group_by(character) %>%
top_n(tf_idf, n = 5) %>%
ungroup() %>%
ggplot(aes(x = reorder_within(word, tf_idf, character), y = tf_idf, fill = character)) +
geom_col() +
scale_x_reordered() +
coord_flip() +
facet_wrap(~character, scales = "free") +
theme(legend.position = "none") +
labs(x = "",
y = "TF-IDF of character-word pairs")
The overall sentiment of The Office can be interpretted using the get_sentiments()
function. In this example, we are using the bing lexicon to compare positive and negative sentiment over all 9 seasons of the show.
office_df %>%
unnest_tokens('word', 'text') %>%
inner_join(top_characters %>%
select(character), by = 'character') %>%
anti_join(stop_words, by = 'word') %>%
filter(!word %in% junk_words,
!word %in% names_blacklist) %>%
select(season, word) %>%
inner_join(get_sentiments(lexicon = "bing")) %>%
group_by(season) %>%
count(sentiment) %>%
ungroup() %>%
ggplot(aes(x = season, y = n, fill = sentiment, group = sentiment)) +
geom_area() +
scale_x_continuous(breaks = 1:9) +
labs(title = 'Area plot of positive and negative sentiments over seasons')
One last quick visual that we might want to summon is the wordcloud. Wordclouds use font size to emphasize the frequency of each word in the document. In the examples below, I will consider a general word cloud, as well as look specific examples based on individual characters.
library(RColorBrewer)
library(wordcloud)
pal <- brewer.pal(8,"Dark2")
words_expanded <- office_df %>%
unnest_tokens('word', 'text') %>%
inner_join(top_characters %>%
select(character), by = 'character') %>%
anti_join(stop_words, by = 'word') %>%
filter(!word %in% junk_words,
!word %in% names_blacklist) %>%
count(character, word) %>%
group_by(character) %>%
slice_max(n, n = 50)
Most words in The Office
words_expanded %>%
ungroup() %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 50, colors=pal))
Most common words for Jim
words_expanded %>%
ungroup() %>%
filter(character == "Jim") %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 50, colors=pal))
Most common words for Pam
words_expanded %>%
ungroup() %>%
filter(character == "Pam") %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 50, colors=pal))