Further cleanup

There are still some small disturbing glitches in the wordlist. Maybe, we do not really want to keep numbers in the package descriptions at all (or we might want to replace all numbers with a placeholder text, such as NUM), and there are some frequent technical words that can be ignored as well, for example, package. Showing the plural version of nouns is also redundant. Let's improve our corpus with some further tweaks, step by step!

Removing the numbers from the package descriptions is fairly straightforward, as based on the previous examples:

> v <- tm_map(v, removeNumbers)

To remove some frequent domain-specific words with less important meanings, let's see the most common words in the documents. For this end, first we have to compute the TermDocumentMatrix function that can be passed later to the findFreqTerms function to identify the most popular terms in the corpus, based on frequency:

> tdm <- TermDocumentMatrix(v)

This object is basically a matrix which includes the words in the rows and the documents in the columns, where the cells show the number of occurrences. For example, let's take a look at the first 5 words' occurrences in the first 20 documents:

> inspect(tdm[1:5, 1:20])
<<TermDocumentMatrix (terms: 5, documents: 20)>>
Non-/sparse entries: 5/95
Sparsity           : 95%
Maximal term length: 14
Weighting          : term frequency (tf)

                Docs
Terms            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  aalenjohansson 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
  abc            0 1 0 1 1 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0
  abcdefba       0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
  abcsmc         0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
  aberrations    0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0

Extracting the overall number of occurrences for each word is fairly easy. In theory, we could compute the rowSums function of this sparse matrix. But let's simply call the findFreqTerms function, which does exactly what we were up to. Let's show all those terms that show up in the descriptions at least a 100 times:

> findFreqTerms(tdm, lowfreq = 100)
 [1] "analysis"     "based"        "bayesian"     "data"        
 [5] "estimation"   "functions"    "generalized"  "inference"   
 [9] "interface"    "linear"       "methods"      "model"       
[13] "models"       "multivariate" "package"      "regression"  
[17] "series"       "statistical"  "test"         "tests"       
[21] "time"         "tools"        "using"       

Manually reviewing this list suggests ignoring the based and using words, besides the previously suggested package term:

> myStopwords <- c('package', 'based', 'using')
> v <- tm_map(v, removeWords, myStopwords)

Stemming words

Now, let's get rid of the plural forms of the nouns, which also occur in the preceding top 20 lists of the most common words! This is not as easy as it sounds. We might apply some regular expressions to cut the trailing s from the words, but this method has many drawbacks, such as not taking into account some evident English grammar rules.

But we can, instead, use some stemming algorithms, especially Porter's stemming algorithm, which is available in the SnowballC package. The wordStem function supports 16 languages (take a look at the getStemLanguages for details), and can identify the stem of a character vector as easily as calling the function:

> library(SnowballC)
> wordStem(c('cats', 'mastering', 'modelling', 'models', 'model'))
[1] "cat"    "master" "model"  "model"  "model"

The only penalty here is the fact that Porter's algorithm does not provide real English words in all cases:

> wordStem(c('are', 'analyst', 'analyze', 'analysis'))
[1] "ar"      "analyst" "analyz"  "analysi"

So later, we will have to tweak the results further; to reconstruct the words with the help of a language lexicon database. The easiest way to construct such a database is copying the words of the already existing corpus:

> d <- v

Then, let's stem all the words in the documents:

> v <- tm_map(v, stemDocument, language = "english")

Now, we called the stemDocument function, which is a wrapper around the SnowballC package's wordStem function. We specified only one parameter, which sets the language of the stemming algorithm. And now, let's call the stemCompletion function on our previously defined directory, and let's formulate each stem to the shortest relevant word found in the database.

Unfortunately, it's not as straightforward as the previous examples, as the stemCompletion function takes a character vector of words instead of documents that we have in our corpus. So thus, we have to write our own transformation function with the previously used content_transformer helper. The basic idea is to split each documents into words by a space, apply the stemCompletion function, and then concatenate the words into sentences again:

> v <- tm_map(v, content_transformer(function(x, d) {
+         paste(stemCompletion(
+                 strsplit(stemDocument(x), ' ')[[1]],
+                 d),
+         collapse = ' ')
+       }), d)

Tip

The preceding example is rather resource hungry, so please be prepared for high CPU usage for around 30 to 60 minutes on a standard PC. As you can (technically) run the forthcoming code samples without actually performing this step, you may feel free to skip to the next code chunk, if in a hurry.

It took some time, huh? Well, we had to iterate through all the words in each document found in the corpus , but it's well worth the trouble! Let's see the top used terms in the cleaned corpus:

> tdm <- TermDocumentMatrix(v)
> findFreqTerms(tdm, lowfreq = 100)
 [1] "algorithm"     "analysing"     "bayesian"      "calculate"    
 [5] "cluster"       "computation"   "data"          "distributed"  
 [9] "estimate"      "fit"           "function"      "general"      
[13] "interface"     "linear"        "method"        "model"        
[17] "multivariable" "network"       "plot"          "random"       
[21] "regression"    "sample"        "selected"      "serial"       
[25] "set"           "simulate"      "statistic"     "test"         
[29] "time"          "tool"          "variable"     

While previously the very same command returned 23 terms, out of which we removed 3, now we see more than 30 words occurring more than 100 times in the corpus. We got rid of the plural versions of the nouns and a few other similar variations of the same terms, so the density of the document term matrix also increased:

> tdm
<<TermDocumentMatrix (terms: 4776, documents: 5880)>>
Non-/sparse entries: 27946/28054934
Sparsity           : 100%
Maximal term length: 35
Weighting          : term frequency (tf)

We not only decreased the number of different words to be indexed in the next steps, but we also identified a few new terms that are to be ignored in our further analysis, for example, set does not seem to be an important word in the package descriptions.

Lemmatisation

While stemming terms, we started to remove characters from the end of words in the hope of finding the stem, which is a heuristic process often resulting in not-existing words, as we have seen previously. We tried to overcome this issue by completing these stems to the shortest meaningful words by using a dictionary, which might result in derivation in the meaning of the term, for example, removing the ness suffix.

Another way to reduce the number of inflectional forms of different terms, instead of deconstructing and then trying to rebuild the words, is morphological analysis with the help of a dictionary. This process is called lemmatisation, which looks for lemma (the canonical form of a word) instead of stems.

The Stanford NLP Group created and maintains a Java-based NLP tool called Stanford CoreNLP, which supports lemmatization besides many other NLP algorithms such as tokenization, sentence splitting, POS tagging, and syntactic parsing.

Tip

You can use CoreNLP from R via the rJava package, or you might install the coreNLP package, which includes some wrapper functions around the CoreNLP Java library, which are meant for providing easy access to, for example, lammatisation. Please note that after installing the R package, you have to use the downloadCoreNLP function to actually install and make accessible the features of the Java library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.198