The janeaustenr package offers a function, austen_books(), that returns a tidy data frame of Jane Austen’s 6 completed, published novels.
# add code here
Demo: Which books are included in the dataset?
# add code here
Word frequencies
Question: What would you expect to be the most common word in Jane Austen novels? Would you expect it to be the same across all books?
Answers may vary.
Demo: Split the text column into word tokens.
# add code here
Your turn: Discover the top 10 most commonly used words in each of Jane Austen’s books.
With stop words:
# add code here
Demo: Let’s do better, without the “stop words”.
stop_words
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# … with 1,139 more rows
Without stop words:
# add code here
With better ordering:
# add code here
Bigram frequencies
An n-gram is a contiguous series of \(n\) words from a text; e.g., a bigram is a pair of words, with \(n = 2\).
Demo: Split the text column into bigram tokens.
# add code here
Your turn: Visualize the frequencies of top 10 bigrams in each of Jane Austen’s books.
# add code here
Verbs that follow she or he
First, let’s define the pronouns of interest:
pronouns <-c("he", "she")
Demo: Filter the dataset for bigrams that start with either “she” or “he” and calculate the number of times these bigrams appeared.
# add code here
Discussion: What can we do next to see if there is a difference in the types of verbs that follow “he” vs. “she”?
Answers may vary.
Demo: Which words have about the same likelihood of following “he” or “she” in Jane Austen’s novels?
# add code here
# add code here
Demo: Which words have different likelihoods of following “he” or “she” in Jane Austen’s novels?
# add code here
Sentiment analysis
One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.1
sentiments <-get_sentiments("afinn")sentiments
# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# … with 2,467 more rows