Lincoln's word frequency

In the same fashion as previously, we'll see the top 10 words Lincoln used. The filter to apply for Abe's addresses is 1861 through 1864:

> sotu_tidy %>%
    dplyr::filter(year > 1860 & year < 1865) %>%
    dplyr::count(word, sort = TRUE)
# A tibble: 3,562 x 2
   word           n
   <chr>      <int>
 1 congress      81
 2 united        81
 3 government    75
 4 people        70
 5 war           65
 6 country       62
 7 time          51

 8 union         50
 9 national      49
10 public        48
# ... with 3,552 more rows

No surprise that war is high on the list with the Civil War during that time period. One way to visualize how the addresses changed and stayed the same is to produce a word cloud for each address. A convenient way to do that is with the qdap package. We first need to filter out Lincoln's speeches from the tokenized data frame. Then, we produce a separate word cloud for each year. Notice that I specify a minimum frequency of seven words per year and specify no stemming. This produces the following four different plots:

> sotu_cloud <- sotu_tidy %>%
    dplyr::filter(year > 1860 & year < 1865)

> qdap::trans_cloud(
    text.var = sotu_cloud$word,
    grouping.var = sotu_cloud$year,
    stem = FALSE,
    min.freq = 7
 )

The output of the preceding code is as follows:

Output Frame 2:

Output Frame 3:

Output Frame 4:

Very similar themes throughout, but notice you have a clear focus on emancipation and slavery in 1862 and 1863. An interesting analytical method is to drill down on a term and put it in context, or what we can call keywords in context. However, to do that we need to transform our data. The quanteda package has a keyword in context function kwic(), but it requires the data be in a corpus, which demands that the text be back in one cell per document, and not one token per row per document. The implication to us is that we need to unnest the tidy data frame. This accomplishes that and just selects the year 1862:

> nested_1862 <- sotu_tidy %>%
    dplyr::filter(year == 1862) %>%
    dplyr::select(year, word) %>%
    tidyr::nest(word) %>%
    dplyr::mutate(
    text = purrr::map(data, unlist),
    text = purrr::map_chr(text, paste, collapse = " ")
 )

This gives us the text with stop words removed and back in one cell. To put this in a corpus structure, the tm package is useful:

> myCorpus <- tm::Corpus(tm::VectorSource(nested_1862$text))

For this example of keywords in context, we should look at where Lincoln discusses emancipation. An important specification in the function is how many words to the left and right of our keyword we want to see as the context of interest. Here is the abbreviated content:

> quanteda::kwic(x = myCorpus$content, pattern = "emancipation", window = 6)
                                                                        
 [text1, 1462] paper respectfully recall attention called compensated |
 [text1, 2076] plan mutual concessions plan adopted assumed |
 [text1, 2873] recommendation congress provide law compensating adopt |
 [text1, 2939] slave concurrence obtained assurance severally adopting |
 emancipation | nation consist territory people laws territory 
 emancipation | follow article main emancipation length time 
 emancipation | plan acted earnestly renewed advance plan 
 emancipation | distant day constitutional terms assurance struggle

The output can be awkward to interpret at first. However, what it produces is the document number of the corpus the text is from, so with just one text cell, all output is text1. Then, it shows what character number our keyword starts with (1462). What we have left is the six words prior to our keyword and the six words after it. The first line of text would read like this: paper respectfully recall attention called compensated emancipation nation consist territory people laws territory. That might seem confusing, but the item of interest is the concept of compensating regions for emancipation. The full output, and including more context words, can help get a sense of Lincoln's problems and solutions for emancipation. As historical background, Lincoln delivered the address on December 1, 1862, and the political opposition in the Union was in an uproar over the Emancipation Proclamation he issued two and a half months before. Lincoln had to dance a political jig, in essence, moderating his stance by claiming that emancipation would be gradual and done with compensation. In short, looking at keywords in context can help in deriving an understanding for yourself and with your customers about how to interpret textual data.

We'll now take a look at implementing sentiment analysis in a tidyverse fashion.

Table of Contents for Lincoln's word frequency

Create new playlist

Sign In

Sign Up

Table of Contents for
Lincoln's word frequency