Lincoln's word frequency

In the same fashion as previously, we'll see the top 10 words Lincoln used. The filter to apply for Abe's addresses is 1861 through 1864: 

> sotu_tidy %>%
dplyr::filter(year > 1860 & year < 1865) %>%
dplyr::count(word, sort = TRUE)
# A tibble: 3,562 x 2
word n
<chr> <int>
1 congress 81
2 united 81
3 government 75
4 people 70
5 war 65
6 country 62
7 time 51
 8 union         50
9 national 49
10 public 48
# ... with 3,552 more rows

No surprise that war is high on the list with the Civil War during that time period. One way to visualize how the addresses changed and stayed the same is to produce a word cloud for each address. A convenient way to do that is with the qdap package. We first need to filter out Lincoln's speeches from the tokenized data frame. Then, we produce a separate word cloud for each year. Notice that I specify a minimum frequency of seven words per year and specify no stemming. This produces the following four different plots:

> sotu_cloud <- sotu_tidy %>%
dplyr::filter(year > 1860 & year < 1865)

> qdap::trans_cloud(
text.var = sotu_cloud$word,
grouping.var = sotu_cloud$year,
stem = FALSE,
min.freq = 7
)

The output of the preceding code is as follows:

Output Frame 2:

Output Frame 3:

Output Frame 4:

Very similar themes throughout, but notice you have a clear focus on emancipation and slavery in 1862 and 1863. An interesting analytical method is to drill down on a term and put it in context, or what we can call keywords in context. However, to do that we need to transform our data. The quanteda package has a keyword in context function kwic(), but it requires the data be in a corpus, which demands that the text be back in one cell per document, and not one token per row per document. The implication to us is that we need to unnest the tidy data frame. This accomplishes that and just selects the year 1862:

> nested_1862 <- sotu_tidy %>%
dplyr::filter(year == 1862) %>%
dplyr::select(year, word) %>%
tidyr::nest(word) %>%
dplyr::mutate(
text = purrr::map(data, unlist),
text = purrr::map_chr(text, paste, collapse = " ")
)

This gives us the text with stop words removed and back in one cell. To put this in a corpus structure, the tm package is useful:

> myCorpus <- tm::Corpus(tm::VectorSource(nested_1862$text))

For this example of keywords in context, we should look at where Lincoln discusses emancipation. An important specification in the function is how many words to the left and right of our keyword we want to see as the context of interest.  Here is the abbreviated content:

> quanteda::kwic(x = myCorpus$content, pattern = "emancipation", window = 6)

[text1, 1462] paper respectfully recall attention called compensated |
[text1, 2076] plan mutual concessions plan adopted assumed |
[text1, 2873] recommendation congress provide law compensating adopt |
[text1, 2939] slave concurrence obtained assurance severally adopting |
emancipation | nation consist territory people laws territory
emancipation | follow article main emancipation length time
emancipation | plan acted earnestly renewed advance plan
emancipation | distant day constitutional terms assurance struggle

The output can be awkward to interpret at first. However, what it produces is the document number of the corpus the text is from, so with just one text cell, all output is text1. Then, it shows what character number our keyword starts with (1462). What we have left is the six words prior to our keyword and the six words after it. The first line of text would read like this: paper respectfully recall attention called compensated emancipation nation consist territory people laws territory. That might seem confusing, but the item of interest is the concept of compensating regions for emancipation. The full output, and including more context words, can help get a sense of Lincoln's problems and solutions for emancipation. As historical background, Lincoln delivered the address on December 1, 1862, and the political opposition in the Union was in an uproar over the Emancipation Proclamation he issued two and a half months before. Lincoln had to dance a political jig, in essence, moderating his stance by claiming that emancipation would be gradual and done with compensation. In short, looking at keywords in context can help in deriving an understanding for yourself and with your customers about how to interpret textual data. 

We'll now take a look at implementing sentiment analysis in a tidyverse fashion.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.109.8