if (!require("pacman")) install.packages("pacman")
p_load(tidyr, tidytext, tibble, dplyr, ggplot2, viridis, purrr, forcats, igraph, ggraph)
p_load_gh("dgrtwo/widyr")
p_load_gh("pachamaltese/lp")
<- list(
lp_albums lost_on_you = lost_on_you,
forever_for_now = forever_for_now,
heart_to_mouth = heart_to_mouth
)
<- map_df(
lp_albums_tidy seq_along(lp_albums),
function(x) {
%>%
lp_albums[[x]]
enframe(name = "song") %>%
unnest(cols = "value") %>%
filter(!grepl("\\[", value)) %>%
unnest_tokens(line, value, token = "lines") %>%
group_by(song) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, line) %>%
mutate(album = names(lp_albums[x])) %>%
select(album, song, word, linenumber) %>%
anti_join(stop_words)
} )
Motivation
The Life Changing Magic of Tidying Text is one of those post I keep re-reading from time to time and I wanted to try the analysis with songs.
I shall use lp
package, a small data package I had for experimental purposes.
Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.
LP Can Be So Tidy
I use unnest_tokens
to divide the lyrics in words. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.
The data is in one-word-per-row format, and we can manipulate it with tidy tools like dplyr. For example, in the last chunk I used an anti_join
to remove words such a “a”, “an” or “the”.
Then we can use count
to find the most common words in all of LP songs as a whole.
%>%
lp_albums_tidy count(word, sort = TRUE)
# A tibble: 904 × 2
word n
<chr> <int>
1 love 116
2 ooh 75
3 halo 69
4 baby 68
5 yeah 62
6 living 55
7 eh 51
8 lost 47
9 die 38
10 light 37
# ℹ 894 more rows
Most LP songs are about love, and some are covers. For example, halo is the 3rd most repeated word and we can see it in the next song.
$halo_live forever_for_now
[1] "Remember those walls I built"
[2] "Well, baby they're tumbling down"
[3] "And they didn't even put up a fight"
[4] "They didn't even make up a sound"
[5] ""
[6] "I found a way to let you in"
[7] "But I never really had a doubt"
[8] "Standing in the light of your halo"
[9] "I got my angel now"
[10] ""
[11] "It's like I've been awakened"
[12] "Every rule I had you breakin'"
[13] "It's the risk that I'm takin'"
[14] "I ain't never gonna shut you out"
[15] ""
[16] "Everywhere I'm looking now"
[17] "I'm surrounded by your embrace"
[18] "Baby I can see your halo"
[19] "You know you're my saving grace"
[20] ""
[21] "You're everything I need and more"
[22] "It's written all over your face"
[23] "Baby I can feel your halo"
[24] "Pray it won't fade away"
[25] ""
[26] "I can feel your halo halo halo"
[27] "I can see your halo halo halo"
[28] "I can feel your halo halo halo"
[29] "I can see your halo halo halo"
[30] ""
[31] "Hit me like a ray of sun"
[32] "Burning through my darkest night"
[33] "You're the only one that I want"
[34] "Think I'm addicted to your light"
[35] ""
[36] "I swore I'd never fall again"
[37] "But this don't even feel like falling"
[38] "Gravity can't forget"
[39] "To pull me back to the ground again"
[40] ""
[41] "Feels like I've been awakened"
[42] "Every rule I had you breakin'"
[43] "The risk that I'm takin'"
[44] "I'm never gonna shut you out"
[45] ""
[46] "Everywhere I'm looking now"
[47] "I'm surrounded by your embrace"
[48] "Baby I can see your halo"
[49] "You know you're my saving grace"
[50] ""
[51] "You're everything I need and more"
[52] "It's written all over your face"
[53] "Baby I can feel your halo"
[54] "Pray it won't fade away"
[55] ""
[56] "I can feel your halo halo halo"
[57] "I can see your halo halo halo"
[58] "I can feel your halo halo halo"
[59] "I can see your halo halo halo"
[60] ""
[61] "I can feel your halo halo halo"
[62] "I can see your halo halo halo"
[63] "I can feel your halo halo halo"
[64] "I can see your halo halo halo"
[65] "Halo, halo"
[66] ""
[67] "Everywhere I'm looking now"
[68] "I'm surrounded by your embrace"
[69] "Baby I can see your halo"
[70] "You know you're my saving grace"
[71] ""
[72] "You're everything I need and more"
[73] "It's written all over your face"
[74] "Baby I can feel your halo"
[75] "Pray it won't fade away"
[76] ""
[77] "I can feel your halo halo halo"
[78] "I can see your halo halo halo"
[79] "I can feel your halo halo halo"
[80] "I can see your halo halo halo"
[81] ""
[82] "I can feel your halo halo halo"
[83] "I can see your halo halo halo"
[84] "I can feel your halo halo halo"
[85] "I can see your halo halo halo"
Sentiment analysis can be done as an inner join. There is one sentiment lexicon in the tidytext package. Let’s examine how sentiment changes changes during each album. Let’s count the number of positive and negative words in the songs of each album
<- lp_albums_tidy %>%
lp_albums_sentiment inner_join(sentiments) %>%
count(song, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
left_join(
%>%
lp_albums_tidy select(song, album) %>%
distinct() %>%
group_by(album) %>%
mutate(song_number = row_number()) %>%
ungroup()
%>%
) mutate(
album = as.factor(album),
album = fct_relevel(album, "lost_on_you", "forever_for_now")
%>%
) arrange(album, song_number) %>%
select(album, song, song_number, everything())
lp_albums_sentiment
# A tibble: 49 × 6
album song song_number negative positive sentiment
<fct> <chr> <int> <dbl> <dbl> <dbl>
1 lost_on_you muddy_waters 1 24 3 -21
2 lost_on_you no_witness 2 13 7 -6
3 lost_on_you lost_on_you 3 33 5 -28
4 lost_on_you when_we_are_high 4 18 2 -16
5 lost_on_you switchblade 5 13 11 -2
6 lost_on_you up_against_me 6 1 3 2
7 lost_on_you suspicion 7 29 3 -26
8 lost_on_you tightrope 8 15 2 -13
9 lost_on_you other_people 9 12 11 -1
10 lost_on_you into_the_wild 10 36 8 -28
# ℹ 39 more rows
Now we can plot these sentiment scores across the plot trajectory of each album.
ggplot(lp_albums_sentiment, aes(song_number, sentiment, fill = album)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~album, nrow = 3, scales = "free_x", dir = "v") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment in LP's Albums",
y = "Sentiment") +
scale_fill_viridis(end = 0.75, discrete = TRUE) +
scale_x_discrete(expand = c(0.02,0)) +
theme(strip.text = element_text(hjust = 0)) +
theme(strip.text = element_text(face = "italic")) +
theme(axis.title.x = element_blank()) +
theme(axis.ticks.x = element_blank()) +
theme(axis.text.x = element_blank())
Looking at Units Beyond Words
Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that I am not having a good day is a negative sentence, not a positive one, because of negation.
<- map_df(
lp_albums_lines seq_along(lp_albums),
function(x) {
%>%
lp_albums[[x]] enframe(name = "song") %>%
unnest(cols = "value") %>%
filter(!grepl("\\[", value)) %>%
unnest_tokens(line, value, token = "lines") %>%
ungroup() %>%
mutate(album = names(lp_albums[x])) %>%
select(album, song, line)
} )
Let’s look at just one.
$line[44] lp_albums_lines
[1] "oh no, oh no"
We can use tidy text analysis to ask questions such as: What are the most negative song in each of LP’s albums? First, let’s get the list of negative words from the lexicon. Second, let’s make a dataframe of how many words are in each song so we can normalize for the length of songs. Then, let’s find the number of negative words in each song and divide by the total words in each song. Which song has the highest proportion of negative words?
<- sentiments %>%
sentiment_negative filter(sentiment == "negative")
<- lp_albums_tidy %>%
wordcounts group_by(album, song) %>%
summarize(words = n())
%>%
lp_albums_tidy semi_join(sentiment_negative) %>%
group_by(album, song) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("album", "song")) %>%
mutate(ratio = negativewords/words) %>%
top_n(1)
# A tibble: 3 × 5
# Groups: album [3]
album song negativewords words ratio
<chr> <chr> <int> <int> <dbl>
1 forever_for_now wasted_love_live 24 87 0.276
2 heart_to_mouth die_for_your_love 18 88 0.205
3 lost_on_you lost_on_you 33 68 0.485
Networks of Words
Another function in widyr is pairwise_count
, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the songs of the first album.
<- lp_albums_tidy %>%
word_cooccurences filter(album == "lost_on_you") %>%
pairwise_count(word, linenumber, sort = TRUE)
word_cooccurences
# A tibble: 13,398 × 3
item1 item2 n
<chr> <chr> <dbl>
1 witness bear 11
2 bear witness 11
3 lost love 10
4 love lost 10
5 we’re muddy 9
6 muddy we’re 9
7 die lost 9
8 lost die 9
9 gonna die 9
10 die gonna 9
# ℹ 13,388 more rows
This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.
set.seed(1724)
%>%
word_cooccurences filter(n >= 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "kk") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a8a8a8") +
geom_node_point(color = "darkslategray4", size = 8) +
geom_node_text(aes(label = name), vjust = 2.2) +
ggtitle("Word Network in LP's albums") +
theme_void()
It looks good!