Loading the corpus

Before we start, let's perform some preliminary steps by running the following code:

1  URL = "http://www.cs.cornell.edu/people/pabo/
2  movie-review-data/review_polarity.tar.gz"
3  download.file(URL,destfile = "reviews.tar.gz")
4  untar("reviews.tar.gz")

This downloads the data you will use in a compressed file. Line 1 and 2 here should be typed on the same line in your console or script window with nospace between the quotation marks. Next, the file is uncompressed in a folder called txt_sentoken in your working directory. Change your working directory to point to this folder by using the following code line:

setwd("txt_sentoken")

The folder contains the subfolders pos and neg. The pos folder contains 1,000 positive film reviews, whereas the neg folder contains 1,000 negative film reviews. The reviews were collected by researchers at Cornell University. We will analyze these texts here. The first thing we will do is load both corpora into R.

For this purpose, and to accomplish most of the tasks we will deal with here, we will download and load the tm package:

install.packages("tm"); library(tm)

We will now load the two corpora separately into R—the first corpus containing the positive reviews followed by the corpus containing the negative reviews. The pattern="cv" argument allows us to specify that we only want to load the files that contain cv in their name:

1  SourcePos = DirSource(file.path(".", "pos"), pattern="cv")
2  SourceNeg = DirSource(file.path(".", "neg"), pattern="cv")
3  pos = Corpus(SourcePos)
4  neg = Corpus(SourceNeg)

There are other ways to load a corpus. We will not list them all, but we will provide the most common ones here so that you're all set if you wish to use other formats.

If we were using a data frame source or a vector source for the positive examples, instead of using DirSource, we would have written (do not run this code now):

Pos = DataframeSource(theDataframeName) # for a data frame corpus
Pos = VectorSource(theVectorName) #for a vector corpus

Both types of corpora can simply be created from CSV files.

If we were using PDF files (which would require installing the application xpdf), or word documents (which would require the application antiword), we would have written (do not run this code now either):

SourcePos = DirSource(file.path(".", "pos"), readerControl=list(reader=readPDF) # for pdf filesSourcePos = DirSource(file.path(".", "pos"), readerControl=list(reader=readDOC) # for word files

There are other sources available. For a full list, type:

getSources(); getReaders()

Going back to our analysis, we can check if our corpora have been loaded correctly:

pos

The following output shows that the positive reviews have been loaded correctly:

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1000

Let's examine the corpus of negative reviews:

neg

The following output shows the negative reviews have been loaded correctly:

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1000

We can now append the corpora and check whether this operation worked as well (it did, as displayed in the following code). Remember, the first 1,000 reviews are the positive ones and the other 1,000 reviews are the negative ones:

reviews = c(pos, neg)
reviews

The output shows we do have 2,000 cases here:

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2000

We can see that the joint corpus contains 2,000 documents as we requested

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.194.106