One of the nicest features of the tm
package is the variety of bundled transformations to be applied on corpora (corpuses). The tm_map
function provides a convenient way of running the transformations on the corpus to filter out all the data that is irrelevant in the actual research. To see the list of available transformation methods, simply call the getTransformations
function:
> getTransformations() [1] "as.PlainTextDocument" "removeNumbers" [3] "removePunctuation" "removeWords" [5] "stemDocument" "stripWhitespace"
We should usually start with removing the most frequently used, so called stopwords from the corpus. These are the most common, short function terms, which usually carry less important meanings than the other expressions in the corpus, especially the keywords. The package already includes such lists of words in different languages:
> stopwords("english") [1] "i" "me" "my" "myself" "we" [6] "our" "ours" "ourselves" "you" "your" [11] "yours" "yourself" "yourselves" "he" "him" [16] "his" "himself" "she" "her" "hers" [21] "herself" "it" "its" "itself" "they" [26] "them" "their" "theirs" "themselves" "what" [31] "which" "who" "whom" "this" "that" [36] "these" "those" "am" "is" "are" [41] "was" "were" "be" "been" "being" [46] "have" "has" "had" "having" "do" [51] "does" "did" "doing" "would" "should" [56] "could" "ought" "i'm" "you're" "he's" [61] "she's" "it's" "we're" "they're" "i've" [66] "you've" "we've" "they've" "i'd" "you'd" [71] "he'd" "she'd" "we'd" "they'd" "i'll" [76] "you'll" "he'll" "she'll" "we'll" "they'll" [81] "isn't" "aren't" "wasn't" "weren't" "hasn't" [86] "haven't" "hadn't" "doesn't" "don't" "didn't" [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" [96] "cannot" "couldn't" "mustn't" "let's" "that's" [101] "who's" "what's" "here's" "there's" "when's" [106] "where's" "why's" "how's" "a" "an" [111] "the" "and" "but" "if" "or" [116] "because" "as" "until" "while" "of" [121] "at" "by" "for" "with" "about" [126] "against" "between" "into" "through" "during" [131] "before" "after" "above" "below" "to" [136] "from" "up" "down" "in" "out" [141] "on" "off" "over" "under" "again" [146] "further" "then" "once" "here" "there" [151] "when" "where" "why" "how" "all" [156] "any" "both" "each" "few" "more" [161] "most" "other" "some" "such" "no" [166] "nor" "not" "only" "own" "same" [171] "so" "than" "too" "very"
Skimming through this list verifies that removing these rather unimportant words will not really modify the meaning of the R package descriptions. Although there are some rare cases in which removing the stopwords is not a good idea at all! Carefully examine the output of the following R command:
> removeWords('to be or not to be', stopwords("english")) [1] " "
This does not suggest that the memorable quote from Shakespeare is meaningless, or that we can ignore any of the stopwords in all cases. Sometimes, these words have a very important role in the context, where replacing the words with a space is not useful, but rather deteriorative. Although I would suggest, that in most cases, removing the stopwords is highly practical for keeping the number of words to process at a low level.
To iteratively apply the previous call on each document in our corpus, the tm_map
function is extremely useful:
> v <- tm_map(v, removeWords, stopwords("english"))
Simply pass the corpus and the transformation function, along with its parameters, to tm_map
, which takes and returns a corpus of any number of documents:
> inspect(head(v, 3)) <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> A3 Accurate Adaptable Accessible Error Metrics Predictive Models [[2]] <<PlainTextDocument (metadata: 7)>> Tools Approximate Bayesian Computation ABC [[3]] <<PlainTextDocument (metadata: 7)>> ABCDEFBA ABiologistCanDoEverything Flux Balance Analysis package
We can see that the most common function words and a few special characters are now gone from the package descriptions. But what happens if someone starts the description with uppercase stopwords? This is shown in the following example:
> removeWords('To be or not to be.', stopwords("english")) [1] "To ."
It's clear that the uppercase version of the to
common word was not removed from the sentence, and the trailing dot was also preserved. For this end, usually, we should simply transform the uppercase letters to lowercase, and replace the punctuations with a space to keep the clutter among the keywords at a minimal level:
> v <- tm_map(v, content_transformer(tolower)) > v <- tm_map(v, removePunctuation) > v <- tm_map(v, stripWhitespace) > inspect(head(v, 3)) <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>> [[1]] [1] a3 accurate adaptable accessible error metrics predictive models [[2]] [1] tools approximate bayesian computation abc [[3]] [1] abcdefba abiologistcandoeverything flux balance analysis package
So, we first called the tolower
function from the base
package to transform all characters from upper to lower case. Please note that we had to wrap the tolower
function in the content_transformer
function, so that our transformation really complies with the tm
package's object structure. This is usually required when using a transformation function outside of the tm
package.
Then, we removed all the punctuation marks from the text with the help of the removePunctutation
function. The punctuations marks are the ones referred to as [:punct:]
in regular expressions, including the following characters: !
"
#
$
%
&
'
( )
*
+
,
-
.
/
:
;
<
=
>
?
@
[
]
^
_
`
{
|
}
~'
. Usually, it's safe to remove these separators, especially when we analyze the words on their own and not their relations.
And we also removed the multiple whitespace characters from the document, so that we find only one space between the filtered words.
18.191.237.79