library(data.table)
library(tidyr)
library(tidytext)
library(dplyr)
library(ggplot2)
library(scales)
library(viridis)
library(ggstance)
library(stringr)
library(widyr)
<- list.files("../../../10/13/rick-and-morty-tidy-data-1", pattern = "subs", full.names = T)
subs
<- as_tibble(fread(subs[[1]])) %>%
archer_subs mutate(text = iconv(text, to = "ASCII")) %>%
drop_na()
<- as_tibble(fread(subs[[2]])) %>%
bojack_horseman_subs mutate(text = iconv(text, to = "ASCII")) %>%
drop_na()
<- as_tibble(fread(subs[[3]])) %>%
gravity_falls_subs mutate(text = iconv(text, to = "ASCII")) %>%
drop_na()
<- as_tibble(fread(subs[[4]])) %>%
rick_and_morty_subs mutate(text = iconv(text, to = "ASCII")) %>%
drop_na()
<- as_tibble(fread(subs[[5]])) %>%
stranger_things_subs mutate(text = iconv(text, to = "ASCII")) %>%
drop_na()
<- archer_subs %>%
archer_subs_tidy unnest_tokens(word,text) %>%
anti_join(stop_words)
<- bojack_horseman_subs %>%
bojack_horseman_subs_tidy unnest_tokens(word,text) %>%
anti_join(stop_words)
<- gravity_falls_subs %>%
gravity_falls_subs_tidy unnest_tokens(word,text) %>%
anti_join(stop_words)
<- rick_and_morty_subs %>%
rick_and_morty_subs_tidy unnest_tokens(word,text) %>%
anti_join(stop_words)
<- stranger_things_subs %>%
stranger_things_subs_tidy unnest_tokens(word,text) %>%
anti_join(stop_words)
Rick and Morty and Tidy Data Principles (Part 3)
Updated 2022-05-28: I moved the blog to Quarto, so I had to update the paths. I am also not using pacman and I am loading libraries in the classic way now.
Motivation
The first and second part of this analysis gave the idea that I did too much scrapping and processing and that deserves more analysis to use that information well. In this third and final part I’m also taking a lot of ideas from Julia Silge’s blog.
In the GitHub repo of this project you shall find not just Rick and Morty processed subs, but also for Archer, Bojack Horseman, Gravity Falls and Stranger Things. Why? In this post post I’m gonna compare the different shows.
Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.
Word Frequencies
Comparing frequencies across different shows can tell us how similar Rick and Morty, for example, is similar to Gravity Falls. I’ll use the subtitles from different shows that I scraped in the first part of this post.
With this processing we can compare frequencies across different shows. Here’s an example of the top ten words for each show:
bind_cols(rick_and_morty_subs_tidy %>%
count(word, sort = TRUE) %>%
filter(row_number() <= 10),
%>%
archer_subs_tidy count(word, sort = TRUE) %>%
filter(row_number() <= 10),
%>%
bojack_horseman_subs_tidy count(word, sort = TRUE) %>%
filter(row_number() <= 10),
%>%
gravity_falls_subs_tidy count(word, sort = TRUE) %>%
filter(row_number() <= 10),
%>%
stranger_things_subs_tidy count(word, sort = TRUE) %>%
filter(row_number() <= 10)) %>%
setNames(c("rm_word","rm_n","a_word","a_n","bh_word","bh_n","gf_word","gf_n","st_word","st_n"))
# A tibble: 10 × 10
rm_word rm_n a_word a_n bh_word bh_n gf_word gf_n st_word st_n
<chr> <int> <chr> <int> <chr> <int> <chr> <int> <chr> <int>
1 morty 1890 archer 4526 bojack 807 mabel 456 yeah 482
2 rick 1669 lana 2795 yeah 695 hey 453 hey 317
3 jerry 645 yeah 1474 hey 567 ha 416 mike 271
4 yeah 475 cyril 1471 gonna 480 stan 369 sighs 261
5 gonna 418 malory 1460 time 446 dipper 347 uh 189
6 summer 405 pam 1297 uh 380 gonna 341 dustin 179
7 hey 386 god 873 diane 345 time 313 lucas 173
8 uh 327 wait 844 todd 329 yeah 291 gonna 166
9 time 313 uh 830 people 307 uh 264 joyce 161
10 beth 301 gonna 745 love 306 guys 244 mom 157
There are common words such as “yeah” for example.
Now I’ll combine the frequencies of all the shows and I’ll plot the top 50 frequencies to see similitudes with Rick and Morty:
<- bind_rows(mutate(archer_subs_tidy, show = "Archer"),
tidy_others mutate(bojack_horseman_subs_tidy, show = "Bojack Horseman"),
mutate(gravity_falls_subs_tidy, show = "Gravity Falls"),
mutate(stranger_things_subs_tidy, show = "Stranger Things"))
<- tidy_others %>%
frequency mutate(word = str_extract(word, "[a-z]+")) %>%
count(show, word) %>%
rename(other = n) %>%
inner_join(count(rick_and_morty_subs_tidy, word)) %>%
rename(rick_and_morty = n) %>%
mutate(other = other / sum(other),
rick_and_morty = rick_and_morty / sum(rick_and_morty)) %>%
ungroup()
<- frequency %>%
frequency_top_50 group_by(show) %>%
arrange(-other,-rick_and_morty) %>%
filter(row_number() <= 50)
ggplot(frequency_top_50, aes(x = other, y = rick_and_morty, color = abs(rick_and_morty - other))) +
geom_abline(color = "gray40") +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.5), low = "darkslategray4", high = "gray75") +
facet_wrap(~show, ncol = 4) +
theme_minimal(base_size = 14) +
theme(legend.position="none") +
labs(title = "Comparing Word Frequencies",
subtitle = "Word frequencies in Rick and Morty episodes versus other shows'",
y = "Rick and Morty", x = NULL)
What is only noticeable if you have seen the analysed shows suggests that we should explore global measures of lexical variety such as mean word frequency and type-token ratios.
Before going ahead let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Rick and Morty and the other shows?
cor.test(data = filter(frequency, show == "Archer"), ~ other + rick_and_morty)
Pearson's product-moment correlation
data: other and rick_and_morty
t = 62.686, df = 4603, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6627349 0.6939104
sample estimates:
cor
0.6786282
cor.test(data = filter(frequency, show == "Bojack Horseman"), ~ other + rick_and_morty)
Pearson's product-moment correlation
data: other and rick_and_morty
t = 33.925, df = 4006, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4480085 0.4961205
sample estimates:
cor
0.4724163
cor.test(data = filter(frequency, show == "Gravity Falls"), ~ other + rick_and_morty)
Pearson's product-moment correlation
data: other and rick_and_morty
t = 60.553, df = 3361, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7057501 0.7380991
sample estimates:
cor
0.7223195
cor.test(data = filter(frequency, show == "Stranger Things"), ~ other + rick_and_morty)
Pearson's product-moment correlation
data: other and rick_and_morty
t = 21.906, df = 2252, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3844795 0.4525693
sample estimates:
cor
0.4191135
The correlation test suggests that Rick and Morty and Gravity Falls are the most similar from the considered sample.
The end
My analysis is now complete but the GitHub repo is open to anyone interested in using it for his/her own analysis. I covered mostly microanalysis, or words analysis as isolated units, while providing rusty bits of analysis beyond words as units that would deserve more and longer posts.
Those who find in this a useful material may explore global measures. One option is to read Text Analysis with R for Students of Literature that I’ve reviewed some time ago.
Interesting topics to explore are Hapax richness and keywords in context that correspond to mesoanalysis or even going for macroanalysis to do clustering, classification and topic modelling.