N-grams

Looking at combinations of words in, say, bigrams or trigrams can help you understand relationships between words. Using tidy methods again, we'll create bigrams and learn about those relationships to extract insights from the text. I will continue with the subject of President Lincoln as that will allow you to compare what you gain with n-grams versus just words. Getting started is easy, as you just specify the number of words to join. Notice in the following code that I maintain word capitalization:

> sotu_bigrams <- sotu_meta %>%
dplyr::filter(year > 1860 & year < 1865) %>%
tidytext::unnest_tokens(bigram, text, token = "ngrams", n = 2,
to_lower = FALSE)

Let's take a look at this:

> sotu_bigrams %>%
dplyr::count(bigram, sort = TRUE)
# A tibble: 17,687 x 2
bigram n
<chr> <int>
1 of the 509
2 to the 180
3 in the 146
4 by the 97
5 for the 94
6 have been 82
7 United States 79
8 and the 76
9 has been 76
10 the United 73
# ... with 17,677 more rows

Those pesky stop words! Fear not, as we can deal with them in short order:

> bigrams_separated <- sotu_bigrams %>%
tidyr::separate(bigram, c("word1", "word2"), sep = " ")

> bigrams_filtered <- bigrams_separated %>%
dplyr::filter(!word1 %in% stop_words$word) %>%
dplyr::filter(!word2 %in% stop_words$word)

Now, it makes sense to look at Lincoln's bigrams:

> bigram_counts <- bigrams_filtered %>% 
dplyr::count(word1, word2, sort = TRUE)

> bigram_counts
# A tibble: 3,488 x 3
word1 word2 n
<chr> <chr> <int>
1 United States 79
2 public debt 11
3 public lands 10
4 Great Britain 9
5 civil war 8
6 I recommend 8
7 naval service 8
8 annual message 7
9 foreign nations 7
10 free colored 7
# ... with 3,478 more rows

This is interesting, I believe. I found it surprising that Great Britain was there nine times, but on reflection realized they were a political thorn in the Union's side. I'll spare you the details. You can create a visual representation of these word relationships via a network graph:

> bigram_graph <- bigram_counts %>%
dplyr::filter(n > 4) %>%
igraph::graph_from_data_frame()

> set.seed(1861) #

> ggraph::ggraph(bigram_graph, layout = "fr") +
ggraph::geom_edge_link() +
ggraph::geom_node_point() +
ggraph::geom_node_text(ggplot2::aes(label = name), vjust = 1, hjust = 1)

The output of the preceding code is as follows:

I think it is safe to say that the use of n-grams can help you learn from text. In combination with analysis by tokenizing words, we can start to see some patterns and themes. However, we can take our understanding to the next level by building topic models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.62