Word frequency in all addresses

To get rid of stop words in a tidy format, you can use the stop_words data frame provided in the tidytext package. You call that tibble into the environment, then do an anti-join by word:

> library(tidytext)

> data(stop_words)

> sotu_tidy <- sotu_unnest %>%
dplyr::anti_join(stop_words, by = "word")

Notice that the length of the data went from 1.97 million observations down to 778,161. Now, you can go ahead and see the top words. I don't do it in the following, but you can put this into a data frame if you so choose: 

> sotu_tidy %>%
dplyr::count(word, sort = TRUE)
# A tibble: 29,558 x 2
word n
<chr> <int>
1 government 7573
2 congress 5759
3 united 5102
4 people 4219
5 country 3564
6 public 3413
7 time 3138
8 war 2961
9 american 2853
10 world 2581
# ... with 29,548 more rows

We can pass this data to ggplot2, in this case, words that occur more than 2,500 times:

> sotu_tidy %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::filter(n > 2500) %>%
dplyr::mutate(word = reorder(word, n)) %>%
ggplot2::ggplot(ggplot2::aes(word, n)) +
ggplot2::geom_col() +
ggplot2::xlab(NULL) +
ggplot2::coord_flip() +
ggthemes::theme_igray()

The output of the preceding code is as follows:

We can look at the addresses that contain the most total words: 

> sotu_tidy %>%
dplyr::group_by(year) %>%
dplyr::summarise(totalWords = length(word)) %>%
dplyr::arrange(desc(totalWords))
# A tibble: 225 x 2
year totalWords
<int> <int>
1 1981 18402
2 1980 17553
3 1946 12614
4 1974 11813
5 1979 11730
6 1910 11178
7 1907 10230
8 1912 10215
9 1911 9598
10 1899 9504
# ... with 215 more rows

How about that? The two longest speeches were given by Ronald Reagan, often called The Great Communicator. Moving on, we'll take a look at Lincoln's top word frequency, then create a word cloud for each of the separate addresses.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.22.23