N-grams

Looking at combinations of words in, say, bigrams or trigrams can help you understand relationships between words. Using tidy methods again, we'll create bigrams and learn about those relationships to extract insights from the text. I will continue with the subject of President Lincoln as that will allow you to compare what you gain with n-grams versus just words. Getting started is easy, as you just specify the number of words to join. Notice in the following code that I maintain word capitalization:

> sotu_bigrams <- sotu_meta %>%
    dplyr::filter(year > 1860 & year < 1865) %>%
    tidytext::unnest_tokens(bigram, text, token = "ngrams", n = 2, 
    to_lower =   FALSE)

Let's take a look at this:

> sotu_bigrams %>%
    dplyr::count(bigram, sort = TRUE)
# A tibble: 17,687 x 2
   bigram n
   <chr>         <int>
 1 of the        509
 2 to the        180
 3 in the        146
 4 by the         97
 5 for the        94
 6 have been      82
 7 United States  79
 8 and the        76
 9 has been       76
10 the United     73
# ... with 17,677 more rows

Those pesky stop words! Fear not, as we can deal with them in short order:

> bigrams_separated <- sotu_bigrams %>%
    tidyr::separate(bigram, c("word1", "word2"), sep = " ")

> bigrams_filtered <- bigrams_separated %>%
    dplyr::filter(!word1 %in% stop_words$word) %>%
    dplyr::filter(!word2 %in% stop_words$word)

Now, it makes sense to look at Lincoln's bigrams:

> bigram_counts <- bigrams_filtered %>% 
    dplyr::count(word1, word2, sort = TRUE)

> bigram_counts
# A tibble: 3,488 x 3
   word1   word2      n
   <chr>   <chr>  <int>
 1 United  States    79
 2 public  debt      11
 3 public  lands     10
 4 Great   Britain    9
 5 civil   war        8
 6 I       recommend  8
 7 naval   service    8
 8 annual  message    7
 9 foreign nations    7
10 free    colored    7
# ... with 3,478 more rows

This is interesting, I believe. I found it surprising that Great Britain was there nine times, but on reflection realized they were a political thorn in the Union's side. I'll spare you the details. You can create a visual representation of these word relationships via a network graph:

> bigram_graph <- bigram_counts %>%
 dplyr::filter(n > 4) %>%
 igraph::graph_from_data_frame()

> set.seed(1861) #

> ggraph::ggraph(bigram_graph, layout = "fr") +
 ggraph::geom_edge_link() +
 ggraph::geom_node_point() +
 ggraph::geom_node_text(ggplot2::aes(label = name), vjust = 1, hjust = 1)

The output of the preceding code is as follows:

I think it is safe to say that the use of n-grams can help you learn from text. In combination with analysis by tokenizing words, we can start to see some patterns and themes. However, we can take our understanding to the next level by building topic models.

Table of Contents for N-grams

Create new playlist

Sign In

Sign Up

Table of Contents for
N-grams