Case study 3 – IRT models for unsupervised sentiment scaling

The theoretical underpinnings of IRT models were set out in the previous chapter. Here, we briefly review them before demonstrating how to implement this class of models. However, readers should note that this class of model is cutting-edge, to the point of being considered experimental.

IRT models for text analysis start with the strong assumption that texts (or authors thereof) lie along a continuum, and that this continuum directly affects their word choices in a monotonic way such that if word use is likely at one end of the spectrum, it is unlikely at the other. These assumptions are somewhat restrictive; we can only scale texts that deal with moderately narrow topics and that are subject to word choice differences. Furthermore, it is important that the sentiment continuum be that underlying continuum; else, the model will estimate whatever continuum underlies the data. A good example would be the debate about the Affordable Care Act, wherein liberals occasionally refer to it by name, while conservatives are more likely to refer to it as Obamacare. Note that for IRT methods to function, these differences only have to be probabilistic (that is, some liberals certainly call it Obamacare some of the time), and these word choice differences have to be consistent across a number of terms, not just one.

As an example, we can use the same Twitter data set out in the section on Naive Bayes. We should suspect that these tweets meet the assumptions made earlier, namely, there is a single underlying continuum of sentiment about a single moderately narrow topic, and that that continuum likely affects the word choice of each author.

The preprocessing steps are very similar to those of the Naive Bayes application discussed in the previous section. To begin, take the same initial steps: collect tweets with one (or more) searchTwitter call(s); turn them into a data frame called tweetsdf; and append an outcome variable called hash to this data frame. Again, these steps are exactly the same as the ones discussed earlier, and are thus not repeated. One difference is that, before continuing, we will drop retweets from the data frame. The reason is that including a large number of retweets affects scaling negatively. It is a best practice to drop the retweets and then just give all tweets with the same text the same scale after running the IRT model. The following line of code in English reads: put into twtsdf all of the rows in twtsdf where the variable isRetweet is FALSE.

> twtsdf<- twtsdf[twtsdf$isRetweet == FALSE,]

Next, we continue following the Naive Bayes preprocessing steps by dropping the unnecessary variables from the data frame. After that, we execute the for loop that generates our list of vectors of words, create a corpus, and turn it into a document-term matrix, just as done earlier. We normalize all cells to one, and then drop infrequent bigrams, as in the Naïve Bayes example.

The second difference in the preprocessing steps for IRT is that we need to aggregate all of the tweets by user, that is, collapse the document-term matrix such that there is only one row per user instead of one row per tweet. This step follows from the fact that IRT models assume that all of an author's word choices are affected by his or her position on the underlying continuum. Thus, we aggregate all of their tweets together as shown in the following code snippet:

# pull out list of words
> words<- colnames(dat.out)
# aggregate by rowname (i.e. twitter user name), and sum rows with the same user name
> dat.agg<- aggregate(dat.out, list(rownames(dat.out)), sum)
# aggregating makes a variable called Group.1; turn this back into the matrix rowname
> names<- dat.agg$Group.1
> dat.agg<- as.matrix(dat.agg[,2:dim(dat.agg)[2]])
# set cells greater than 1 back to 1
> dat.agg[dat.agg>1] <- 1
> rownames(dat.agg) <- names

This bit of code similarly aggregates our hashtag vector, which we will use for model checking later:

> outcomes<- as.matrix(twtsdf$hash)
> rownames(outcomes) <- rownames(dat.out)
> outcomes.agg <- aggregate(outcomes, list(rownames(outcomes)), mean)
> hashscores<- round(outcomes.agg$V1)

The final preprocessing step mimics that of Naive Bayes: we drop users who use a very small number of bigrams. In Twitter examples, we find that only keeping users who employ at least four key bigrams makes a good choice, though there is no hard-and-fast rule about this:

> num.zero <- rowSums(dat.agg==0)

# explore data by making a table; can inform choice of cutoff
> table(num.zero)   
# the number of columns of the document bigram matrix    
> num_cols <- dim(dat.out)[2]  
# users must have used this many bigrams to scale
> cutoff <- 4     
# create a list of authors to keep        
> authors_tokeep <- which(num.zero <(num_cols-cutoff))  
# keep only users with 2 bigrams  
> dat.drop <- dat.out[authors_tokeep,] 
# similarly, drop those users from the vector of hashtags   
> outcome <- outcome[authors_tokeep]    

After all that processing, we are ready to implement the model. The package we employ, pscl, was designed for political science uses, hence the mentions of legislators, votes, and roll-calls. The rc function sets up the data matrix for estimation, while the ideal function estimates the logistic model described in the previous chapter as follows:

> require(pscl)
> rc <- rollcall(data=dat.drop)     # sets up the data
# executes the model (this may take several minutes)
> jmod <- ideal(rc,store.item=T)    

Now we have a new object named jmod. The xbar variable captures the scale positions of every author for whom we had enough data to scale. One thing we may want is a list of all of the scaled positions of all users as shown in the following code snippet:

> scaled.positions<- data.frame(jmod$xbar)     # make a new data frame of the scale
> rownames(scaled.positions) <- rownames(dat.drop)   # make the rownames sensible
> colnames(scaled.positions) <- "scale"      # give the single variable a good name
> head(scaled.positions)           # list the first few scaled positions
# note: these are not real user names
scale
exUser1          0.739675498
exUser2          -0.099127391
exUser3          0.466915634

Another way to look at this data is to draw a histogram of all of the scaled positions:

> hist(jmod$xbar, main= "Scaled Positions of Twitter Users", xlab= "relative position")
Case study 3 – IRT models for unsupervised sentiment scaling

The histogram shows the overall distribution of positions in the data. Some care needs to be taken in interpreting these values. Foremost, these values are relative to one another. Put in a different way, these numbers describe distances between authors but not compared to any true values. This means that zero, rather than meaning neutral, probably means something closer to central. Similarly, the score only captures the fact that an author at zero is to the left of authors at 0.1, and to the right of authors at -0.1. This means that, had we only scaled the prochoice tweets, we still would have had a distribution centered at zero, even though all of the tweets being scaled would be to the right of center.

A second caution concerns authors scaled at or near zero. These may be centrist authors; however, authors that are difficult to scale also end up in the center. This may be because they use few bigrams or because they use conflicting bigrams (some to the left and some to the right). The point is to note that zero may not mean center; it may also mean "I'm not sure".

Additionally, the model has no sense of left or right; it just assigns numbers to authors. To figure out whether authors with scores of +1 are liberals or conservatives, we need to examine the tweets of a couple of authors from each end. First, identify a handful of authors with large, positive scale values, and then go back to the original data frame and read their tweets. This should give you a sense of which end of the scale is which. Additionally, it should give you an idea of whether the underlying scale estimated corresponds to the latent dimension you had hoped to capture, that is, sentiment. The following is a list of positive authors:

> scaled.positions[scaled.positions$scale> 0.9,]   
# generates a list of large, positive authors
# these are just examples, not real users
> MrTweeter             0.9137                        
> AnotherTweeter         0.9442

Then, go back to the original data frame and pull up the tweet(s) from those users. This should tell you if the people on the positive end of this scale are on the left or right. You should check several of these from both ends. People on the same end of the scale should have the same views, generally. If not, the model may have failed or pulled out an underlying continuum other than the one you were looking for!

> twtsdf[twtsdf$id== "MrTweeter",]
1   This would be MrTweeter's original tweet and hashtag.

Also, we may be interested in the difficulty and discrimination parameters of each bigram. We can get a sense of both by plotting one against the other. The jmod$betabar[,2] parameter is the difficulty parameter, and jmod$betabar[,1] is the discrimination parameter.

> plot(jmod$betabar[,2], jmod$betabar[,1], 
    xlab="difficulty", ylab="discrimination", main="Bigrams")
Case study 3 – IRT models for unsupervised sentiment scaling

On this graph, each point represents a bigram. The x axis represents how hard it is for a user to use a bigram; essentially, this is a measure of the bigram's rarity. The y axis plots each bigram's discrimination, that is, the extent to which it is more likely to be used by those on one side of the scale or the other. For instance, bigrams with large, positive discrimination parameters are likely to be used by those on the right-hand side of the scale and unlikely to be used by those on the left-hand side. The sign determines left/right, and the magnitude represents how strong the effect is. Bigrams with near-zero discrimination parameters are equally used by authors on all parts of the scale. Examining this graph, we see that most bigrams are not discriminating between sides of the scale. Additionally, there is a strong correlation between difficulty and discrimination; bigrams that are used frequently do not discriminate much, whereas more infrequently used bigrams discriminate better. This is a classic pattern in scaling applications; absence of this type of flying V pattern is evidence that scaling has failed or that the model has picked up an underlying continuum that is bizarre or nonsensical.

If you want to get a sense of the most discriminating bigrams, you can generate a list with the following code, which uses the plyr package for its convenient arrange() function:

# identify which words have large discrimination parameters
# abs() returns the absolute value
> t <- which(abs(jmod$betabar[,1]) >1)
> twords<- colnames(dat.drop)[t]
> tnums<- jmod$betabar[,1][t]
# make a data frame of the discriminating words
> bigwords<- data.frame(twords, tnums)
> bigwords<- arrange(bigwords,desc(bigwords$tnums))

Finally, since we know what side of the abortion issue all of our users were on, we can plot them and color code the hashtag they used (this is why we saved the hashtags during data processing). In real, unsupervised sentiment analysis settings, you would not have this information, but since we do, we can use it to check the accuracy of the model as follows:

# make a vector, o, of the order of the scale positions
> o <- order(jmod$xbar)
# plot the scale against an arbitrary y value; zero to number of users
# color code each point according to its hashtag with the 'col' parameter
> plot(jmod$xbar[o], seq(1:length(jmod$xbar)), col=-outcome[o]+3)

We've put the results of this code in the following graph, and also added a text callout of a couple of the tweets and a legend for readability. The code for the extras is simple and can be found on this book's web page on the Packt Publishing website:

Case study 3 – IRT models for unsupervised sentiment scaling

The x axis here represents the scale position, while the y axis is arbitrary. The green points are those tweets that used #prolife, whereas the red ones used #prochoice. We can see, generally, that most of the red points are to the left of zero, while most of the green points are to the right of zero. Thus, the model seems to have scaled most of these tweets accurately. However, the model has a tough time discriminating authors near the middle of the scale, which is common in scaling applications. Additionally, note the group of mis-characterized tweets in the upper-left corner. On further examination, these are users who tweeted prochoice-related phrases, but used them ironically. The model, not recognizing irony, incorrectly put these tweets with the prochoice tweets. This is a common hurdle in text analysis and represents a fruitful area for future research.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.37.151