://github.com/pachadotdev/rick_and_morty_tidy_text git clone https
Rick and Morty and Tidy Data Principles (Part 1)
Updated 2022-05-28: I moved the blog to Quarto, so I had to update the paths. I am also not using pacman and I am loading libraries in the classic way now.
Motivation
After reading The Life Changing Magic of Tidying Text and A tidy text analysis of Rick and Morty I thought about doing something similar but reproducible and focused on Rick and Morty.
In this post I’ll focus on the Tidy Data principles. However, here is the Github repo with the scripts to scrap the transcripts and subtitles of Rick and Morty.
Here I’m using the subtitles of the TV show, as some of the transcripts I could scrap were incomplete.
Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.
Let’s scrap
The subtools package returns a data frame after reading srt files. In addition to that resulting data frame I wanted to explicitly point the season and chapter of each line of the subtitles. To do that I had to scrap the subtitles and then use str_replace_all
. To follow the steps clone the repo from Github:
Rick and Morty Can Be So Tidy
After reading the tidy file I created after scraping the subtitles, I use unnest_tokens
to divide the subtitles in words. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.
library(data.table)
library(tidyr)
library(tidytext)
library(dplyr)
library(ggplot2)
library(viridis)
library(ggstance)
library(igraph)
library(ggraph)
library(widyr)
<- as_tibble(fread("rick_and_morty_subs.csv")) %>%
rick_and_morty_subs mutate(text = iconv(text, to = "ASCII")) %>%
drop_na()
<- rick_and_morty_subs %>%
rick_and_morty_subs_tidy unnest_tokens(word,text) %>%
anti_join(stop_words)
The data is in one-word-per-row format, and we can manipulate it with tidy tools like dplyr. For example, in the last chunk I used an anti_join
to remove words such a “a”, “an” or “the”.
Then we can use count
to find the most common words in all of Rick and Morty episodes as a whole.
%>%
rick_and_morty_subs_tidy count(word, sort = TRUE)
# A tibble: 8,032 × 2
word n
<chr> <int>
1 morty 1890
2 rick 1669
3 jerry 645
4 yeah 475
5 gonna 418
6 summer 405
7 hey 386
8 uh 327
9 time 313
10 beth 301
# ℹ 8,022 more rows
Sentiment analysis can be done as an inner join. There is one sentiment lexicon in the tidytext package. Let’s examine how sentiment changes changes during each season. Let’s count the number of positive and negative words in the chapters of each season.
<- rick_and_morty_subs_tidy %>%
rick_and_morty_sentiment inner_join(sentiments) %>%
count(episode_name, index = linenumber %/% 50, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
left_join(rick_and_morty_subs_tidy[,c("episode_name","season","episode")] %>% distinct()) %>%
arrange(season,episode) %>%
mutate(episode_name = paste(season,episode,"-",episode_name),
season = factor(season, labels = c("Season 1", "Season 2", "Season 3"))) %>%
select(episode_name, season, everything(), -episode)
rick_and_morty_sentiment
# A tibble: 438 × 6
episode_name season index negative positive sentiment
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 S01 E01 - Pilot Season 1 0 6 3 -3
2 S01 E01 - Pilot Season 1 1 10 0 -10
3 S01 E01 - Pilot Season 1 2 3 1 -2
4 S01 E01 - Pilot Season 1 3 10 4 -6
5 S01 E01 - Pilot Season 1 4 2 5 3
6 S01 E01 - Pilot Season 1 5 8 4 -4
7 S01 E01 - Pilot Season 1 6 6 1 -5
8 S01 E01 - Pilot Season 1 7 7 4 -3
9 S01 E01 - Pilot Season 1 8 14 5 -9
10 S01 E01 - Pilot Season 1 9 3 2 -1
# ℹ 428 more rows
Now we can plot these sentiment scores across the plot trajectory of each season. In the second plot I’m just showing Dan Harmon’s favourite episodes provided at the moment the show has 31 episodes in total.
ggplot(rick_and_morty_sentiment, aes(index, sentiment, fill = season)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~season, nrow = 3, scales = "free_x", dir = "v") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment in Rick and Morty",
y = "Sentiment") +
scale_fill_viridis(end = 0.75, discrete = TRUE) +
scale_x_discrete(expand = c(0.02,0)) +
theme(strip.text = element_text(hjust = 0)) +
theme(strip.text = element_text(face = "italic")) +
theme(axis.title.x = element_blank()) +
theme(axis.ticks.x = element_blank()) +
theme(axis.text.x = element_blank())
<- rick_and_morty_sentiment %>%
rick_and_morty_sentiment_favourites filter(grepl("S03 E03|S03 E07|S01 E06|S02 E03|S02 E07", episode_name))
ggplot(rick_and_morty_sentiment_favourites, aes(index, sentiment, fill = season)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~episode_name, ncol = 3, scales = "free_x", dir = "h") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment in Rick and Morty\n(Creator's favourite episodes)",
y = "Sentiment") +
scale_fill_viridis(end = 0.75, discrete = TRUE) +
scale_x_discrete(expand = c(0.02,0)) +
theme(strip.text = element_text(hjust = 0)) +
theme(strip.text = element_text(face = "italic")) +
theme(axis.title.x = element_blank()) +
theme(axis.ticks.x = element_blank()) +
theme(axis.text.x = element_blank())
Looking at Units Beyond Words
Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that I am not having a good day is a negative sentence, not a positive one, because of negation.
<- rick_and_morty_subs %>%
rick_and_morty_sentences group_by(season) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
ungroup()
Let’s look at just one.
$sentence[99] rick_and_morty_sentences
[1] "ooh!"
We can use tidy text analysis to ask questions such as: What are the most negative episodes in each of Rick and Morty’s seasons? First, let’s get the list of negative words from the lexicon. Second, let’s make a dataframe of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. Which chapter has the highest proportion of negative words?
<- sentiments %>%
sentiment_negative filter(sentiment == "negative")
<- rick_and_morty_subs_tidy %>%
wordcounts group_by(season, episode) %>%
summarize(words = n())
%>%
rick_and_morty_subs_tidy semi_join(sentiment_negative) %>%
group_by(season, episode) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("season", "episode")) %>%
mutate(ratio = negativewords/words) %>%
top_n(1)
# A tibble: 3 × 5
# Groups: season [3]
season episode negativewords words ratio
<chr> <chr> <int> <int> <dbl>
1 S01 E07 131 1220 0.107
2 S02 E01 184 1386 0.133
3 S03 E06 192 1435 0.134
Networks of Words
Another function in widyr is pairwise_count
, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the lines of the first season.
<- rick_and_morty_subs_tidy %>%
rick_and_morty_words filter(season == "S01")
<- rick_and_morty_words %>%
word_cooccurences pairwise_count(word, linenumber, sort = TRUE)
word_cooccurences
# A tibble: 216,186 × 3
item1 item2 n
<chr> <chr> <dbl>
1 morty rick 476
2 rick morty 476
3 jerry rick 245
4 rick jerry 245
5 jerry morty 241
6 morty jerry 241
7 yeah rick 137
8 rick yeah 137
9 yeah morty 134
10 morty yeah 134
# ℹ 216,176 more rows
This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.
set.seed(1717)
%>%
word_cooccurences filter(n >= 25) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a8a8a8") +
geom_node_point(color = "darkslategray4", size = 8) +
geom_node_text(aes(label = name), vjust = 2.2) +
ggtitle(expression(paste("Word Network in Rick and Morty's ",
italic("Season One")))) +
theme_void()
It looks good! at least it contains the main characters and Rick’s swearing.