Word frequency in all addresses

To get rid of stop words in a tidy format, you can use the stop_words data frame provided in the tidytext package. You call that tibble into the environment, then do an anti-join by word:

> library(tidytext)

> data(stop_words)

> sotu_tidy <- sotu_unnest %>%
    dplyr::anti_join(stop_words, by = "word")

Notice that the length of the data went from 1.97 million observations down to 778,161. Now, you can go ahead and see the top words. I don't do it in the following, but you can put this into a data frame if you so choose:

> sotu_tidy %>%
    dplyr::count(word, sort = TRUE)
# A tibble: 29,558 x 2
   word           n
   <chr>      <int>
 1 government  7573
 2 congress    5759
 3 united      5102
 4 people      4219
 5 country     3564
 6 public      3413
 7 time        3138
 8 war         2961
 9 american    2853
10 world       2581
# ... with 29,548 more rows

We can pass this data to ggplot2, in this case, words that occur more than 2,500 times:

> sotu_tidy %>%
    dplyr::count(word, sort = TRUE) %>%
    dplyr::filter(n > 2500) %>%
    dplyr::mutate(word = reorder(word, n)) %>%
    ggplot2::ggplot(ggplot2::aes(word, n)) +
    ggplot2::geom_col() +
    ggplot2::xlab(NULL) +
    ggplot2::coord_flip() +
    ggthemes::theme_igray()

The output of the preceding code is as follows:

We can look at the addresses that contain the most total words:

> sotu_tidy %>%
    dplyr::group_by(year) %>%
    dplyr::summarise(totalWords = length(word)) %>%
    dplyr::arrange(desc(totalWords))
# A tibble: 225 x 2
    year totalWords
   <int>      <int>
 1 1981       18402
 2 1980       17553
 3 1946       12614
 4 1974       11813
 5 1979       11730
 6 1910       11178
 7 1907       10230
 8 1912       10215
 9 1911        9598
10 1899        9504
# ... with 215 more rows

How about that? The two longest speeches were given by Ronald Reagan, often called The Great Communicator. Moving on, we'll take a look at Lincoln's top word frequency, then create a word cloud for each of the separate addresses.

Table of Contents for Word frequency in all addresses

Create new playlist

Sign In

Sign Up

Table of Contents for
Word frequency in all addresses