Chapter 12. Bringing It All Together

While we have demonstrated many aspects of using Java to support data science tasks, the need to combine and use these techniques in an integrated manner exists. It is one thing to use the techniques in isolation and another to use them in a cohesive fashion. In this chapter, we will provide you with additional experience with these technologies and insights into how they can be used together.

Specifically, we will create a console-based application that analyzes tweets related to a user-defined topic. Using a console-based application allows us to focus on data-science-specific technologies and avoids having to choose a specific GUI technology that may not be relevant to us. It provides a common base from which a GUI implementation can be created if needed.

The application performs and illustrates the following high-level tasks:

  • Data acquisition
  • Data cleaning, including:
    • Removing stop words
    • Cleaning the text
      • Sentiment analysis
      • Basic data statistic collection
      • Display of results

More than one type of analysis can be used with many of these steps. We will show the more relevant approaches and allude to other possibilities as appropriate. We will use Java 8's features whenever possible.

Defining the purpose and scope of our application

The application will prompt the user for a set of selection criteria, which include topic and sub-topic areas, and the number of tweets to process. The analysis performed will simply compute and display the number of positive and negative tweets for a topic and sub-topic. We used a generic sentiment analysis model, which will affect the quality of the sentiment analysis. However, other models and more analysis can be added.

We will use a Java 8 stream to structure the processing of tweet data. It is a stream of TweetHandler objects, as we will describe shortly.

We use several classes in this application. They are summarized here:

  • TweetHandler: This class holds the raw tweet text and specific fields needed for the processing including the actual tweet, username, and similar attributes.
  • TwitterStream: This is used to acquire the application's data. Using a specific class separates the acquisition of the data from its processing. The class possesses a few fields that control how the data is acquired.
  • ApplicationDriver: This contains the main method, user prompts, and the TweetHandler stream that controls the analysis.

Each of these classes will be detailed in later sections. However, we will present ApplicationDriver next to provide an overview of the analysis process and how the user interacts with the application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.218.69