Chapter 2. Data Acquisition

It is never much fun to work with code that is not formatted properly or uses variable names that do not convey their intended purpose. The same can be said of data, except that bad data can result in inaccurate results. Thus, data acquisition is an important step in the analysis of data. Data is available from a number of sources but must be retrieved and ultimately processed before it can be useful. It is available from a variety of sources. We can find it in numerous public data sources as simple files, or it may be found in more complex forms across the Internet. In this chapter, we will demonstrate how to acquire data from several of these, including various Internet sites and several social media sites.

We can access data from the Internet by downloading specific files or through a process known as web scraping, which involves extracting the contents of a web page. We also explore a related topic known as web crawling, which involves applications that examine a web site to determine whether it is of interest and then follows embedded links to identify other potentially relevant pages.

We can also extract data from social media sites. These types of sites often hold a treasure trove of data that is readily available if we know how to access it. In this chapter, we will demonstrate how to extract data from several sites, including:

  • Twitter
  • Wikipedia
  • Flickr
  • YouTube

When extracting data from a site, many different data formats may be encountered. We will examine three basic types: text, audio, and video. However, even within text, audio, and video data, many formats exist. For audio data alone, there are 45 audio coding formats compared at https://en.wikipedia.org/wiki/Comparison_of_audio_coding_formats. For textual data, there are almost 300 formats listed at http://fileinfo.com/filetypes/text. In this chapter, we will focus on how to download and extract these types of text as plain text for eventual processing.

We will briefly examine different data formats, followed by an examination of possible data sources. We need this knowledge to demonstrate how to obtain data using different data acquisition techniques.

Understanding the data formats used in data science applications

When we discuss data formats, we are referring to content format, as opposed to the underlying file format, which may not even be visible to most developers. We cannot examine all available formats due to the vast number of formats available. Instead, we will tackle several of the more common formats, providing adequate examples to address the most common data retrieval needs. Specifically, we will demonstrate how to retrieve data stored in the following formats:

  • HTML
  • PDF
  • CSV/TSV
  • Spreadsheets
  • Databases
  • JSON
  • XML

Some of these formats are well supported and documented elsewhere. For example, XML has been in use for years and there are several well-established techniques for accessing XML data in Java. For these types of data, we will outline the major techniques available and show a few examples to illustrate how they work. This will provide those readers who are not familiar with the technology some insight into their nature.

The most common data format is binary files. For example, Word, Excel, and PDF documents are all stored in binary. These require special software to extract information from them. Text data is also very common.

Overview of CSV data

Comma Separated Values (CSV) files, contain tabular data organized in a row-column format. The data, stored as plaintext, is stored in rows, also called records. Each record contains fields separated by commas. These files are also closely related to other delimited files, most notably Tab-Separated Values (TSV) files. The following is a part of a simple CSV file, and these numbers are not intended to represent any specific type of data:

JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE 
10001,44,22,0.5 
10002,35,19,0.54 
10003,1,1,1 

Notice that the first row contains header data to describe the subsequent records. Each value is separated by a comma and corresponds to the header in the same position. In Chapter 3, Data Cleaning, we will discuss CSV files in more depth and examine the support available for different types of delimiters.

Overview of spreadsheets

Spreadsheets are a form of tabular data where information is stored in rows and columns, much like a two-dimensional array. They typically contain numeric and textual information and use formulas to summarize and analyze their contents. Most people are familiar with Excel spreadsheets, but they are also found as part of other product suites, such as OpenOffice.

Spreadsheets are an important data source because they have been used for the past several decades to store information in many industries and applications. Their tabular nature makes them easy to process and analyze. It is important to know how to extract data from this ubiquitous data source so that we can take advantage of the wealth of information that is stored in them.

For some of our examples, we will use a simple Excel spreadsheet that consists of a series of rows containing an ID, along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet looks like this:

ID

Minimum

Maximum

Average

12345

45

89

65.55

23456

78

96

86.75

34567

56

89

67.44

45678

86

99

95.67

In Chapter 3, Data Cleaning, we will learn how to extract data from spreadsheets.

Overview of databases

Data can be found in Database Management Systems (DBMS), which, like spreadsheets, are ubiquitous. Java provides a rich set of options for accessing and processing data in a DBMS. The intent of this section is to provide a basic introduction to database access using Java.

We will demonstrate the essence of connecting to a database, storing information, and retrieving information using JDBC. For this example, we used the MySQL DBMS. However, it will work for other DBMSes as well with a change in the database driver. We created a database called example and a table called URLTABLE using the following command within the MySQL Workbench. There are other tools that can achieve the same results:

CREATE TABLE IF NOT EXISTS `URLTABLE` ( 
  `RecordID` INT(11) NOT NULL AUTO_INCREMENT, 
  `URL` text NOT NULL, 
  PRIMARY KEY (`RecordID`) 
); 

We start with a try block to handle exceptions. A driver is needed to connect to the DBMS. In this example, we used com.mysql.jdbc.Driver. To connect to the database, the getConnection method is used, where the database server location, user ID, and password are passed. These values depend on the DBMS used:

    try { 
        Class.forName("com.mysql.jdbc.Driver"); 
        Stri­ng url = "jdbc:mysql://localhost:3306/example"; 
        connection = DriverManager.getConnection(url, "user ID", 
            "password"); 
            ... 
    } catch (SQLException | ClassNotFoundException ex) { 
        // Handle exceptions 
    } 

Next, we will illustrate how to add information to the database and then how to read it. The SQL INSERT command is constructed in a string. The name of the MySQL database is example. This command will insert values into the URLTABLE table in the database where the question mark is a placeholder for the value to be inserted:

    String insertSQL = "INSERT INTO  `example`.`URLTABLE` " 
        + "(`url`) VALUES " + "(?);"; 

The PreparedStatement class represents an SQL statement to execute. The prepareStatement method creates an instance of the class using the INSERT SQL statement:

    PreparedStatement stmt = connection.prepareStatement(insertSQL); 

We then add URLs to the table using the setString method and the execute method. The setString method possesses two arguments. The first specifies the column index to insert the data and the second is the value to be inserted. The execute method does the actual insertion. We add two URLs in the next sequence:

    stmt.setString(1, "https://en.wikipedia.org/wiki/Data_science"); 
    stmt.execute(); 
    stmt.setString(1,  
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 
    stmt.execute(); 

To read the data, we use a SQL SELECT statement as declared in the selectSQL string. This will return all the rows and columns from the URLTABLE table. The createStatement method creates an instance of a Statement class, which is used for INSERT type statements. The executeQuery method executes the query and returns a ResultSet instance that holds the contents of the table:

    String selectSQL = "select * from URLTABLE"; 
    Statement statement = connection.createStatement(); 
    ResultSet resultSet = statement.executeQuery(selectSQL); 

The following sequence iterates through the table, displaying one row at a time. The argument of the getString method specifies that we want to use the second column of the result set, which corresponds to the URL field:

    out.println("List of URLs"); 
    while (resultSet.next()) { 
        out.println(resultSet.getString(2)); 
    }  

The output of this example, when executed, is as follows:

List of URLs
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly

If you need to empty the contents of the table, use the following sequence:

    Statement statement = connection.createStatement(); 
    statement.execute("TRUNCATE URLTABLE;"); 

This was a brief introduction to database access using Java. There are many resources available that will provide more in-depth coverage of this topic. For example, Oracle provides a more in-depth introduction to this topic at https://docs.oracle.com/javase/tutorial/jdbc/.

Overview of PDF files

The Portable Document Format (PDF) is a format not tied to a specific platform or software application. A PDF document can hold formatted text and images. PDF is an open standard, making it useful in a variety of places.

There are a large number of documents stored as PDF, making it a valuable source of data. There are several Java APIs that allow access to PDF documents, including Apache POI and PDFBox. Techniques for extracting information from a PDF document are illustrated in Chapter 3, Data Cleaning.

Overview of JSON

JavaScript Object Notation (JSON) (http://www.JSON.org/) is a data format used to interchange data. It is easy for humans or machines to read and write. JSON is supported by many languages, including Java, which has several JSON libraries listed at http://www.JSON.org/.

A JSON entity is composed of a set of name-value pairs enclosed in curly braces. We will use this format in several of our examples. In handling YouTube, we will use a JSON object, some of which is shown next, representing the results of a request from a YouTube video:

{ 
  "kind": "youtube#searchResult", 
  "etag": "etag", 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  ... 
} 

Accessing the fields and values of such a document is not hard and is illustrated in Chapter 3, Data Cleaning.

Overview of XML

Extensible Markup Language (XML) is a markup language that specifies a standard document format. Widely used to communicate between applications and across the Internet, XML is popular due to its relative simplicity and flexibility. Documents encoded in XML are character-based and easily read by machines and humans.

XML documents contain markup and content characters. These characters allow parsers to classify the information contained within the document. The document consists of tags, and elements are stored within the tags. Elements may also contain other markup tags and form child elements. Additionally, elements may contain attributes or specific characteristics stored as a name-and-value pair.

An XML document must be well-formed. This means it must follow certain rules such as always using closing tags and only a single root tag. Other rules are discussed at https://en.wikipedia.org/wiki/XML#Well-formedness_and_error-handling.

The Java API for XML Processing (JAXP) consists of three interfaces for parsing XML data. The Document Object Model (DOM) interface parses an XML document and returns a tree structure delineating the structure of the document. The DOM interface parses an entire document as a whole. Alternatively, the Simple API for XML (SAX) parses a document one element at a time. SAX is preferable when memory usage is a concern as DOM requires more resources to construct the tree. DOM, however, offers flexibility over SAX in that any element can be accessed at any time and in any order.

The third Java API is known as Streaming API for XML (StAX). This streaming model was designed to accommodate the best parts of DOM and SAX models by granting flexibility without sacrificing resources. StAX exhibits higher performance, with the trade-off being that access is only available to one location in a document at a time. StAX is the preferred technique if you already know how you want to process the document, but it is also popular for applications with limited available memory.

The following is a simple XML file. Each <text> represents a tag, labelling the element contained within the tags. In this case, the largest node in our file is <music> and contained within it are sets of song data. Each tag within a <song> tag describes another element corresponding to that song. Every tag will eventually have a closing tag, such as </song>. Notice that the first tag contains information about which XML version should be used to parse the file:

<?xml version="1.0"?> 
<music> 
   <song id="1234"> 
      <artist>Patton, Courtney</artist> 
      <name>So This Is Life</name> 
      <genre>Country</genre> 
      <price>2.99</price> 
   </song> 
   <song id="5678"> 
      <artist>Eady, Jason</artist> 
      <name>AM Country Heaven</name> 
      <genre>Country</genre> 
      <price>2.99</price> 
   </song> 
</music> 

There are numerous other XML-related technologies. For example, we can validate a specific XML document using either a DTD document or XML schema writing specifically for that XML document. XML documents can be transformed into a different format using XLST.

Overview of streaming data

Streaming data refers to data generated in a continuous stream and accessed in a sequential, piece-by-piece manner. Much of the data the average Internet user accesses is streamed, including video and audio channels, or text and image data on social media sites. Streaming data is the preferred method when the data is new and changing quickly, or when large data collections are sought.

Streamed data is often ideal for data science research because it generally exists in large quantities and raw format. Much public streaming data is available for free and supported by Java APIs. In this chapter, we are going to examine how to acquire data from streaming sources, including Twitter, Flickr, and YouTube. Despite the use of different techniques and APIs, you will notice similarities between the techniques used to pull data from these sites.

Overview of audio/video/images in Java

There are a large number of formats used to represent images, videos, and audio. This type of data is typically stored in binary format. Analog audio streams are sampled and digitized. Images are often simply collections of bits representing the color of a pixel. The following are links that provide a more in-depth discussion of some of these formats:

Frequently, this type of data can be quite large and must be compressed. When data is compressed two approaches are used. The first is a lossless compression, where less space is used and there is no loss of information. The second is lossy, where information is lost. Losing information is not always a bad thing as sometimes the loss is not noticeable to humans.

As we will demonstrate in Chapter 3, Data Cleaning, this type of data often is compromised in an inconvenient fashion and may need to be cleaned. For example, there may be background noise in an audio recording or an image may need to be smoothed before it can be processed. Image smoothing is demonstrated in Chapter 3, Data Cleaning, using the OpenCV library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.141.75