He replied / she cried: Text mining and gender roles

Application exercise

Introduction

Which verbs follow “she” and “he” pronouns in Jane Austen novels? Are they similar or different?

Goal: Use text mining methods to explore whether verbs that follow she and he pronouns are similar or different.

Inspirations:

Packages

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(tidytext)
library(knitr)
library(janeaustenr) # install.packages("janeaustenr)
library(textdata)    # install.packages("textdata)

Data

The janeaustenr package offers a function, austen_books(), that returns a tidy data frame of Jane Austen’s 6 completed, published novels.

# add code here
  • Demo: Which books are included in the dataset?
# add code here

Word frequencies

  • Question: What would you expect to be the most common word in Jane Austen novels? Would you expect it to be the same across all books?

Answers may vary.

  • Demo: Split the text column into word tokens.
# add code here
  • Your turn: Discover the top 10 most commonly used words in each of Jane Austen’s books.

With stop words:

# add code here
  • Demo: Let’s do better, without the “stop words”.
stop_words
# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# … with 1,139 more rows

Without stop words:

# add code here

With better ordering:

# add code here

Bigram frequencies

An n-gram is a contiguous series of \(n\) words from a text; e.g., a bigram is a pair of words, with \(n = 2\).

  • Demo: Split the text column into bigram tokens.
# add code here
  • Your turn: Visualize the frequencies of top 10 bigrams in each of Jane Austen’s books.
# add code here

Verbs that follow she or he

First, let’s define the pronouns of interest:

pronouns <- c("he", "she")
  • Demo: Filter the dataset for bigrams that start with either “she” or “he” and calculate the number of times these bigrams appeared.
# add code here
  • Discussion: What can we do next to see if there is a difference in the types of verbs that follow “he” vs. “she”?

Answers may vary.

  • Demo: Which words have about the same likelihood of following “he” or “she” in Jane Austen’s novels?
# add code here
# add code here
  • Demo: Which words have different likelihoods of following “he” or “she” in Jane Austen’s novels?
# add code here

Sentiment analysis

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.1

sentiments <- get_sentiments("afinn")
sentiments
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# … with 2,467 more rows
# add code here