Before we start, let's perform some preliminary steps by running the following code:
1 URL = "http://www.cs.cornell.edu/people/pabo/ 2 movie-review-data/review_polarity.tar.gz" 3 download.file(URL,destfile = "reviews.tar.gz") 4 untar("reviews.tar.gz")
This downloads the data you will use in a compressed file. Line 1 and 2 here should be typed on the same line in your console or script window with nospace between the quotation marks. Next, the file is uncompressed in a folder called txt_sentoken
in your working directory. Change your working directory to point to this folder by using the following code line:
setwd("txt_sentoken")
The folder contains the subfolders pos
and neg
. The pos
folder contains 1,000 positive film reviews, whereas the neg
folder contains 1,000 negative film reviews. The reviews were collected by researchers at Cornell University. We will analyze these texts here. The first thing we will do is load both corpora into R.
For this purpose, and to accomplish most of the tasks we will deal with here, we will download and load the tm
package:
install.packages("tm"); library(tm)
We will now load the two corpora separately into R—the first corpus containing the positive reviews followed by the corpus containing the negative reviews. The pattern="cv"
argument allows us to specify that we only want to load the files that contain cv
in their name:
1 SourcePos = DirSource(file.path(".", "pos"), pattern="cv") 2 SourceNeg = DirSource(file.path(".", "neg"), pattern="cv") 3 pos = Corpus(SourcePos) 4 neg = Corpus(SourceNeg)
There are other ways to load a corpus. We will not list them all, but we will provide the most common ones here so that you're all set if you wish to use other formats.
If we were using a data frame source or a vector source for the positive examples, instead of using DirSource
, we would have written (do not run this code now):
Pos = DataframeSource(theDataframeName) # for a data frame corpus Pos = VectorSource(theVectorName) #for a vector corpus
Both types of corpora can simply be created from CSV files.
If we were using PDF files (which would require installing the application xpdf), or word documents (which would require the application antiword), we would have written (do not run this code now either):
SourcePos = DirSource(file.path(".", "pos"), readerControl=list(reader=readPDF) # for pdf filesSourcePos = DirSource(file.path(".", "pos"), readerControl=list(reader=readDOC) # for word files
There are other sources available. For a full list, type:
getSources(); getReaders()
Going back to our analysis, we can check if our corpora have been loaded correctly:
pos
The following output shows that the positive reviews have been loaded correctly:
<<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 1000
Let's examine the corpus of negative reviews:
neg
The following output shows the negative reviews have been loaded correctly:
<<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 1000
We can now append the corpora and check whether this operation worked as well (it did, as displayed in the following code). Remember, the first 1,000 reviews are the positive ones and the other 1,000 reviews are the negative ones:
reviews = c(pos, neg) reviews
The output shows we do have 2,000 cases here:
<<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 2000
We can see that the joint corpus contains 2,000 documents as we requested
3.129.194.106