Cleaning the corpus

One of the nicest features of the tm package is the variety of bundled transformations to be applied on corpora (corpuses). The tm_map function provides a convenient way of running the transformations on the corpus to filter out all the data that is irrelevant in the actual research. To see the list of available transformation methods, simply call the getTransformations function:

> getTransformations()
[1] "as.PlainTextDocument" "removeNumbers"
[3] "removePunctuation"    "removeWords"
[5] "stemDocument"         "stripWhitespace"     

We should usually start with removing the most frequently used, so called stopwords from the corpus. These are the most common, short function terms, which usually carry less important meanings than the other expressions in the corpus, especially the keywords. The package already includes such lists of words in different languages:

> stopwords("english")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"       

Skimming through this list verifies that removing these rather unimportant words will not really modify the meaning of the R package descriptions. Although there are some rare cases in which removing the stopwords is not a good idea at all! Carefully examine the output of the following R command:

> removeWords('to be or not to be', stopwords("english"))
[1] "     "

Note

This does not suggest that the memorable quote from Shakespeare is meaningless, or that we can ignore any of the stopwords in all cases. Sometimes, these words have a very important role in the context, where replacing the words with a space is not useful, but rather deteriorative. Although I would suggest, that in most cases, removing the stopwords is highly practical for keeping the number of words to process at a low level.

To iteratively apply the previous call on each document in our corpus, the tm_map function is extremely useful:

> v <- tm_map(v, removeWords, stopwords("english"))

Simply pass the corpus and the transformation function, along with its parameters, to tm_map, which takes and returns a corpus of any number of documents:

> inspect(head(v, 3))
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
A3 Accurate Adaptable Accessible Error Metrics Predictive Models

[[2]]
<<PlainTextDocument (metadata: 7)>>
Tools Approximate Bayesian Computation ABC

[[3]]
<<PlainTextDocument (metadata: 7)>>
ABCDEFBA ABiologistCanDoEverything Flux Balance Analysis package

We can see that the most common function words and a few special characters are now gone from the package descriptions. But what happens if someone starts the description with uppercase stopwords? This is shown in the following example:

> removeWords('To be or not to be.', stopwords("english"))
[1] "To     ."

It's clear that the uppercase version of the to common word was not removed from the sentence, and the trailing dot was also preserved. For this end, usually, we should simply transform the uppercase letters to lowercase, and replace the punctuations with a space to keep the clutter among the keywords at a minimal level:

> v <- tm_map(v, content_transformer(tolower))
> v <- tm_map(v, removePunctuation)
> v <- tm_map(v, stripWhitespace)
> inspect(head(v, 3))
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
[1] a3 accurate adaptable accessible error metrics predictive models

[[2]]
[1] tools approximate bayesian computation abc

[[3]]
[1] abcdefba abiologistcandoeverything flux balance analysis package

So, we first called the tolower function from the base package to transform all characters from upper to lower case. Please note that we had to wrap the tolower function in the content_transformer function, so that our transformation really complies with the tm package's object structure. This is usually required when using a transformation function outside of the tm package.

Then, we removed all the punctuation marks from the text with the help of the removePunctutation function. The punctuations marks are the ones referred to as [:punct:] in regular expressions, including the following characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~'. Usually, it's safe to remove these separators, especially when we analyze the words on their own and not their relations.

And we also removed the multiple whitespace characters from the document, so that we find only one space between the filtered words.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.157.197