Looking further: Text analysis

Lecture 26

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2022

12/6/22

Warm up

While you wait

  • Fill out course + TA evaluations

  • Clone ae-21

Announcements

  • Course + TA evaluations – We’re at ~40% only

  • Any project questions? Any questions about remaining assessments?

Text analysis

Tidytext

  • Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
  • Learn more at tidytextmining.com

library(tidyverse)
library(tidytext)

What is tidy text?

text <- c("Oh! Get me away from here, I'm dying",
          "Play me a song to set me free",
          "Nobody writes them like they used to",
          "So it may as well be me",
          "Here on my own now after hours",
          "Here on my own now on a bus",
          "Think of it this way",
          "You could either be successful or be us",
          "With our winning smiles, and us",
          "With our catchy tunes or worse",
          "Now we're photogenic",
          "You know, we don't stand a chance")
text
 [1] "Oh! Get me away from here, I'm dying"   
 [2] "Play me a song to set me free"          
 [3] "Nobody writes them like they used to"   
 [4] "So it may as well be me"                
 [5] "Here on my own now after hours"         
 [6] "Here on my own now on a bus"            
 [7] "Think of it this way"                   
 [8] "You could either be successful or be us"
 [9] "With our winning smiles, and us"        
[10] "With our catchy tunes or worse"         
[11] "Now we're photogenic"                   
[12] "You know, we don't stand a chance"      

What is tidy text?

text_df <- tibble(line = 1:12, text = text)
text_df |> print(n = 12)
# A tibble: 12 × 2
    line text                                   
   <int> <chr>                                  
 1     1 Oh! Get me away from here, I'm dying   
 2     2 Play me a song to set me free          
 3     3 Nobody writes them like they used to   
 4     4 So it may as well be me                
 5     5 Here on my own now after hours         
 6     6 Here on my own now on a bus            
 7     7 Think of it this way                   
 8     8 You could either be successful or be us
 9     9 With our winning smiles, and us        
10    10 With our catchy tunes or worse         
11    11 Now we're photogenic                   
12    12 You know, we don't stand a chance      

What is tidy text?

text_df |>
  unnest_tokens(word, text)
# A tibble: 80 × 2
    line word 
   <int> <chr>
 1     1 oh   
 2     1 get  
 3     1 me   
 4     1 away 
 5     1 from 
 6     1 here 
 7     1 i'm  
 8     1 dying
 9     2 play 
10     2 me   
# … with 70 more rows

Counting words

text_df |>
  unnest_tokens(word, text) |>
  count(word, sort = TRUE)
# A tibble: 58 × 2
   word      n
   <chr> <int>
 1 me        4
 2 a         3
 3 be        3
 4 here      3
 5 now       3
 6 on        3
 7 it        2
 8 my        2
 9 or        2
10 our       2
# … with 48 more rows

Application exercise

ae-21

  • Go to the course GitHub org and find your ae-21 (repo name will be suffixed with your GitHub name).
  • Clone the repo in your container, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – 3 days from today.