Unsupervised social media mining – Item Response Theory for text scaling

The techniques set out earlier for scaling or classifying sentiments in texts are fairly robust; that is, they tend to work well under a wide variety of conditions such as heterogeneous text lengths and topic breadths. However, each of these methods requires substantial analyst input, such as labeling training data or creating a lexicon. Item Response Theory (IRT) is a theory, but will be used in this text to refer to a class of statistical models that rely on that theory, providing a way to scale texts according to sentiment in the absence of labeled training data. That is, IRT models are unsupervised learning models.

IRT models were developed by psychologists for scoring complex tests and were then picked up by political scientists who employ them for scaling legislators. We will briefly explain the legislative context as that will help readers build intuition over how models work when applied to scaling texts. Consider a set of V voters, such as US Senators, who, over the course of a year, vote on B bills. For simplicity, assume each voter can only vote yes or no. We could then put all of the data into a matrix, where each row represents a voter and each column a bill. Each cell then represents a particular voter's decision on a particular bill, that is, yes (1) or no (0). Now, we need to make two related assumptions. The first is that all or most of these voters can be described as lying along a single underlying continuum. The second more trivial assumption is that this position influences their votes, at least on some bills. With these assumptions in place, we can estimate a statistical model that describes the probability of each cell in our data matrix being a one or a zero. The model is a function of each bill's difficulty of being voted for (that is, how controversial it is), each voter's position on the underlying scale, and how strongly each bill is affected by voters' locations on the scale. Technically, we estimate a logistic regression as follows:

pr(yvb=1) = logit(b1b*xv - b0b)

Here, x is the scaled position of each voter (v), b0 is the difficulty of voting yes for each bill (b), and b1 is the degree to which each voter's position affects their proclivity to vote in favor of each bill (b). Positive values of b1 mean that voters to the right are more likely to vote in favor of a bill, and negative values of b1 mean that senators to the left are more likely to vote for a bill.

As you will see, we apply the previous assumption to the analogous case of text scaling. To do so, we create a matrix with rows representing authors or documents (instead of voters) and columns representing words or phrases, (instead of bills). Each cell represents whether or not a particular author used a particular word or phrase. We modify the previous assumptions: authors lie along a sentiment continuum, and their placement affects their pattern of word use. The first part of this assumption is limiting. We can only apply the method to sets of documents that are sufficiently narrow to be usefully described by a single underlying continuum, and that continuum must essentially be the sentiment we are trying to measure. The results of this analysis are a continuous scaled measure of author (or document) location (x) as well as estimates of the weights for each word or phrase (b1). This scaled measure of location (x) represents the author's sentiment towards the topic under study if the previous assumptions are met.

The IRT-based method described here has mixed properties. It requires no training data, little subject matter expertise to employ, is language agnostic (that is, could function on any language), and generates a quantitative (instead of merely a binary) measure of a sentiment. However, this model can only be applied to documents that are all about the same topic, can only estimate a single underlying dimension, can be slow to estimate, and is not guaranteed to converge.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.12.170