Understanding the TweetHandler class

The TweetHandler class holds information about a specific tweet. It takes the raw JSON tweet and extracts those parts that are relevant to the application's needs. It also possesses the methods to process the tweet's text such as converting the text to lowercase and removing tweets that are not relevant. The first part of the class is shown next:

public class TweetHandler { 
    private String jsonText; 
    private String text; 
    private Date date; 
    private String language; 
    private String category; 
    private String userName; 
    ... 
    public TweetHandler processJSON() { ... } 
    public TweetHandler toLowerCase(){ ... } 
    public TweetHandler removeStopWords(){ ... }     
    public boolean isEnglish(){ ... }     
    public boolean containsCharacter(String character) { ... }        
    public void computeStats(){ ... } 
    public void buildSentimentAnalysisModel{ ... } 
    public TweetHandler performSentimentAnalysis(){ ... } 
} 

The instance variables show the type of data retrieved from a tweet and processed, as detailed here:

  • jsonText: The raw JSON text
  • text: The text of the processed tweet
  • date: The date of the tweet
  • language: The language of the tweet
  • category: The tweet classification, which is positive or negative
  • userName: The name of the Twitter user

There are several other instance variables used by the class. The following are used to create and use a sentiment analysis model. The classifier static variable refers to the model:

private static String[] labels = {"neg", "pos"}; 
private static int nGramSize = 8; 
private static DynamicLMClassifier<NGramProcessLM>  
    classifier = DynamicLMClassifier.createNGramProcess( 
        labels, nGramSize); 
     

The default constructor is used to provide an instance to build the sentiment model. The single argument constructor creates a TweetHandler object using the raw JSON text:

    public TweetHandler() { 
        this.jsonText = ""; 
    } 
 
    public TweetHandler(String jsonText) { 
        this.jsonText = jsonText; 
    } 

The remainder of the methods are discussed in the following sections.

Extracting data for a sentiment analysis model

In Chapter 9, Text Analysis, we used DL4J to perform sentiment analysis. We will use LingPipe in this example as an alternative to our previous approach. Because we want to classify Twitter data, we chose a dataset with pre-classified tweets, available at http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip. We must complete a one-time process of extracting this data into a format we can use with our model before we continue with our application development.

This dataset exists in a large .csv file with one tweet and classification per line. The tweets are classified as either 0 (negative) or 1 (positive). The following is an example of one line of this data file:

95,0,Sentiment140, - Longest night ever.. ugh!    http://tumblr.com/xwp1yxhi6 

The first element represents a unique ID number which is part of the original data set and which we will use for the filename. The second element is the classification, the third is a data set label (effectively ignored for the purposes of this project), and the last element is the actual tweet text. Before we can use this data with our LingPipe model, we must write each tweet into an individual file. To do this, we created three string variables. The filename variable will be assigned either pos or neg depending on each tweet's classification and will be used in the write operation. We also use the file variable to hold the name of the individual tweet file and the text variable to hold the individual tweet text. Next, we use the readAllLines method with the Paths class's get method to store our data in a List object. We need to specify the charset, StandardCharsets.ISO_8859_1, as well:

try { 
    String filename; 
    String file; 
    String text; 
    List<String> lines = Files.readAllLines( 
Paths.get("\path-to-file\SentimentAnalysisDataset.csv"),  
StandardCharsets.ISO_8859_1); 
    ... 
 
} catch (IOException ex) { 
    // Handle exceptions 
} 

Now we can loop through our list and use the split method to store our .csv data in a string array. We convert the element at position 1 to an integer and determine whether it is a 1. Tweets classified with a 1 are considered positive tweets and we set filename to pos. All other tweets set the filename to neg. We extract the output filename from the element at position 0 and the text from element 3. We ignore the label in position 2 for the purposes of this project. Finally, we write out our data:

for(String s : lines) { 
    String[] oneLine = s.split(","); 
    if(Integer.parseInt(oneLine[1])==1) { 
        filename = "pos"; 
    } else { 
        filename = "neg"; 
    } 
    file = oneLine[0]+".txt"; 
    text = oneLine[3]; 
    Files.write(Paths.get( 
        path-to-file\txt_sentoken"+filename+""+file), 
        text.getBytes()); 
} 

Notice that we created the neg and pos directories within the txt_sentoken directory. This location is important when we read the files to build our model.

Building the sentiment model

Now we are ready to build our model. We loop through the labels array, which contains pos and neg, and for each label we create a new Classification object. We then create a new file using this label and use the listFiles method to create an array of filenames. Next, we will traverse these filenames using a for loop:

public void buildSentimentAnalysisModel() { 
    out.println("Building Sentiment Model"); 
     
    File trainingDir = new File("\path to file\txt_sentoken"); 
    for (int i = 0; i < labels.length; i++) { 
        Classification classification =  
            new Classification(labels[i]); 
        File file = new File(trainingDir, labels[i]); 
        File[] trainingFiles = file.listFiles(); 
        ... 
    } 
} 

Within the for loop, we extract the tweet data and store it in our string, review. We then create a new Classified object using review and classification. Finally we can call the handle method to classify this particular text:

for (int j = 0; j < trainingFiles.length; j++) { 
    try { 
        String review = Files.readFromFile(trainingFiles[j],  
            "ISO-8859-1"); 
        Classified<CharSequence> classified = new  
            Classified<>(review, classification); 
        classifier.handle(classified); 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 
 } 

For the dataset discussed in the previous section, this process may take a substantial amount of time. However, we consider this time trade-off to be worth the quality of analysis made possible by this training data.

Processing the JSON input

The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler class.

The TweetHandler class's processJSON method does the actual data extraction. An instance of the JSONObject is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString method to get the fields we need.

The start of the processJSON method is shown next, where we start by obtaining the JSONObject instance, which we will use to extract the relevant parts of the tweet:

public TweetHandler processJSON() { 
    try { 
        JSONObject jsonObject = new JSONObject(this.jsonText); 
        ... 
    } catch (JSONException ex) { 
        // Handle exceptions 
    } 
    return this; 
} 

First, we extract the tweet's text as shown here:

this.text = jsonObject.getString("text"); 

Next, we extract the tweet's date. We use the SimpleDateFormat class to convert the date string to a Date object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy", whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:

  • EEE: Day of the week specified using three characters
  • MMM: Month, using three characters
  • d: Day of the month
  • HH:mm:ss: Hours, minutes, and seconds
  • Z: Time zone
  • yyyy: Year

The code follows:

SimpleDateFormat sdf = new SimpleDateFormat( 
    "EEE MMM d HH:mm:ss Z yyyy"); 
try { 
    this.date = sdf.parse(jsonObject.getString("created_at")); 
} catch (ParseException ex) { 
    // Handle exceptions 
} 

The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name field:

this.language = jsonObject.getString("lang"); 
JSONObject user = jsonObject.getJSONObject("user"); 
this.userName = user.getString("name"); 

Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.

Cleaning data to improve our results

Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.

There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.

The conversion of the text to lowercase letters is easily achieved as shown here:

    public TweetHandler toLowerCase() { 
        this.text = this.text.toLowerCase().trim(); 
        return this; 
    } 

Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean return value is used by the filter method in the Java 8 stream, which performs the actual removal:

    public boolean isEnglish() { 
        return this.language.equalsIgnoreCase("en"); 
    } 
     
    public boolean containsCharacter(String character) { 
        return this.text.contains(character); 
    } 

Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.

Removing stop words

Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.

There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:

public TweetHandler removeStopWords() { 
    TokenizerFactory tokenizerFactory 
            = IndoEuropeanTokenizerFactory.INSTANCE; 
    tokenizerFactory =  
        new EnglishStopTokenizerFactory(tokenizerFactory); 
    ... 
    return this; 
} 

A series of tokens that do not contain stop words are extracted, and a StringBuilder instance is used to create a string to replace the original text:

Tokenizer tokens = tokenizerFactory.tokenizer( 
        this.text.toCharArray(), 0, this.text.length()); 
StringBuilder buffer = new StringBuilder(); 
for (String word : tokens) { 
    buffer.append(word + " "); 
} 
this.text = buffer.toString(); 

The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "
Text: " + this.text 
            + "
Date: " + this.date 
            + "
Category: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.79.147