Chapter 3. Data Cleaning

Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from multiple sources is combined, is often considered to be a data cleaning activity.

We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:

  • Validity: Ensuring that the data possesses the correct form or structure
  • Accuracy: The values within the data are truly representative of the dataset
  • Completeness: There are no missing elements
  • Consistency: Changes to data are in sync
  • Uniformity: The same units of measurement are used

There are several techniques and tools used to clean data. We will examine the following approaches:

  • Handling different types of data
  • Cleaning and manipulating text data
  • Filling in missing data
  • Validating data

In addition, we will briefly examine several image enhancement techniques.

There are often many ways to accomplish the same cleaning task. For example, there are a number of GUI tools that support data cleaning, such as OpenRefine (http://openrefine.org/). This tool allows a user to read in a dataset and clean it using a variety of techniques. However, it requires a user to interact with the application for each dataset that needs to be cleaned. It is not conducive to automation.

We will focus on how to clean data using Java code. Even then, there may be different techniques to clean the data. We will show multiple approaches to provide the reader with insights on how it can be done. Sometimes, this will use core Java string classes, and at other time, it may use specialized libraries.

These libraries often are more expressive and efficient. However, there are times when using a simple string function is more than adequate to address the problem. Showing complimentary techniques will improve the reader's skill set.

The basic text based tasks include:

  • Data transformation
  • Data imputation (handling missing data)
  • Subsetting data
  • Sorting data
  • Validating data

In this chapter, we are interested in cleaning data. However, part of this process is extracting information from various data sources. The data may be stored in plaintext or in binary form. We need to understand the various formats used to store data before we can begin the cleaning process. Many of these formats were introduced in Chapter 2, Data Acquisition, but we will go into greater detail in the following sections.

Handling data formats

Data comes in all types of forms. We will examine the more commonly used formats and show how they can be extracted from various data sources. Before we can clean data it needs to be extracted from a data source such as a file. In this section, we will build upon the introduction to data formats found in Chapter 2, Data Acquisition, and show how to extract all or part of a dataset. For example, from an HTML page we may want to extract only the text without markup. Or perhaps we are only interested in its figures.

These data formats can be quite complex. The intent of this section is to illustrate the basic techniques commonly used with that data format. Full treatment of a specific data format is beyond the scope of this book. Specifically, we will introduce how the following data formats can be processed from Java:

  • CSV data
  • Spreadsheets
  • Portable Document Format, or PDF files
  • Javascript Object Notation, or JSON files

There are many other file types not addressed here. For example, jsoup is useful for parsing HTML documents. Since we introduced how this is done in the Web scraping in Java section of Chapter 2, Data Acquisition, we will not duplicate the effort here.

Handling CSV data

A common technique for separating information is to use commas or similar separators. Knowing how to work with CSV data allows us to utilize this type of data in our analysis efforts. When we deal with CSV data there are several issues including escaped data and embedded commas.

We will examine a few basic techniques for processing comma-separated data. Due to the row-column structure of CSV data, these techniques will read data from a file and place the data in a two-dimensional array. First, we will use a combination of the Scanner class to read in tokens and the String class split method to separate the data and store it in the array. Next, we will explore using the third-party library, OpenCSV, which offers a more efficient technique.

However, the first approach may only be appropriate for quick and dirty processing of data. We will discuss each of these techniques since they are useful in different situations.

We will use a dataset downloaded from https://www.data.gov/ containing U.S. demographic statistics sorted by ZIP code. This dataset can be downloaded at https://catalog.data.gov/dataset/demographic-statistics-by-zip-code-acfc9. For our purposes, this dataset has been stored in the file Demographics.csv. In this particular file, every row contains the same number of columns. However, not all data will be this clean and the solutions shown next take into account the possibility for jagged arrays.

Note

A jagged array is an array where the number of columns may be different for different rows. For example, row 2 may have 5 elements while row 3 may have 6 elements. When using jagged arrays you have to be careful with your column indexes.

First, we use the Scanner class to read in data from our data file. We will temporarily store the data in an ArrayList since we will not always know how many rows our data contains.

try (Scanner csvData = new Scanner(new File("Demographics.csv"))) {
    ArrayList<String> list = new ArrayList<String>();         
    while (csvData.hasNext()) {             
        list.add(csvData.nextLine());     
} catch (FileNotFoundException ex) {     
    // Handle exceptions 
} 

The list is converted to an array using the toArray method. This version of the method uses a String array as an argument so that the method will know what type of array to create. A two-dimension array is then created to hold the CSV data.

String[] tempArray = list.toArray(new String[1]); 
String[][] csvArray = new String[tempArray.length][]; 

The split method is used to create an array of Strings for each row. This array is assigned to a row of the csvArray.

for(int i=0; i<tempArray.length; i++) { 
    csvArray[i] = tempArray[i].split(","); 
} 

Our next technique will use a third-party library to read in and process CSV data. There are multiple options available, but we will focus on the popular OpenCSV (http://opencsv.sourceforge.net). This library offers several advantages over our previous technique. We can have an arbitrary number of items on each row without worrying about handling exceptions. We also do not need to worry about embedded commas or embedded carriage returns within the data tokens. The library also allows us to choose between reading the entire file at once or using an iterator to process data line-by-line.

First, we need to create an instance of the CSVReader class. Notice the second parameter allows us to specify the delimiter, a useful feature if we have similar file format delimited by tabs or dashes, for example. If we want to read the entire file at one time, we use the readAll method.

CSVReader dataReader = new CSVReader(new    FileReader("Demographics.csv"),','); 
ArrayList<String> holdData = (ArrayList)dataReader.readAll();

We can then process the data as we did above, by splitting the data into a two-dimension array using String class methods. Alternatively, we can process the data one line at a time. In the example that follows, each token is printed out individually but the tokens can also be stored in a two-dimension array or other data structure as appropriate.

CSVReader dataReader = new CSVReader(new    FileReader("Demographics.csv"),','); 
String[] nextLine; 
while ((nextLine = dataReader.readNext()) != null){ 
for(String token : nextLine){ 
    out.println(token); 
  } 
} 
dataReader.close(); 

We can now clean or otherwise process the array.

Handling spreadsheets

Spreadsheets have proven to be a very popular tool for processing numeric and textual data. Due to the wealth of information that has been stored in spreadsheets over the past decades, knowing how to extract information from spreadsheets enables us to take advantage of this widely available data source. In this section, we will demonstrate how this is accomplished using the Apache POI API.

Open Office also supports a spreadsheet application. Open Office documents are stored in XML format which makes it readily accessible using XML parsing technologies. However, the Apache ODF Toolkit (http://incubator.apache.org/odftoolkit/) provides a means of accessing data within a document without knowing the format of the OpenOffice document. This is currently an incubator project and is not fully mature. There are a number of other APIs that can assist in processing OpenOffice documents as detailed on the Open Document Format (ODF) for developers (http://www.opendocumentformat.org/developers/) page.

Handling Excel spreadsheets

Apache POI (http://poi.apache.org/index.html) is a set of APIs providing access to many Microsoft products including Excel and Word. It consists of a series of components designed to access a specific Microsoft product. An overview of these components is found at http://poi.apache.org/overview.html.

In this section we will demonstrate how to read a simple Excel spreadsheet using the XSSF component to access Excel 2007+ spreadsheets. The Javadocs for the Apache POI API is found at https://poi.apache.org/apidocs/index.html.

We will use a simple Excel spreadsheet consisting of a series of rows containing an ID along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet follows:

ID

Minimum

Maximum

Average

12345

45

89

65.55

23456

78

96

86.75

34567

56

89

67.44

45678

86

99

95.67

We start with a try-with-resources block to handle any IOExceptions that may occur:

try (FileInputStream file = new FileInputStream( 
        new File("Sample.xlsx"))) { 
    ... 
    } 
} catch (IOException e) { 
    // Handle exceptions 
} 

An instance of a XSSFWorkbook class is created using the spreadsheet. Since a workbook may consists of multiple spreadsheets, we select the first one using the getSheetAt method.

XSSFWorkbook workbook = new XSSFWorkbook(file); 
XSSFSheet sheet = workbook.getSheetAt(0); 

The next step is to iterate through the rows, and then each column, of the spreadsheet:

for(Row row : sheet) { 
    for (Cell cell : row) { 
        ... 
    } 
out.println(); 

Each cell of the spreadsheet may use a different format. We use the getCellType method to determine its type and then use the appropriate method to extract the data in the cell. In this example we are only working with numeric and text data.

switch (cell.getCellType()) { 
    case Cell.CELL_TYPE_NUMERIC: 
        out.print(cell.getNumericCellValue() + "	"); 
        break; 
    case Cell.CELL_TYPE_STRING: 
        out.print(cell.getStringCellValue() + "	"); 
        break; 
   } 

When executed we get the following output:

ID Minimum Maximum Average 
12345.0 45.0 89.0 65.55
23456.0 78.0 96.0 86.75
34567.0 56.0 89.0 67.44
45678.0 86.0 99.0 95.67

POI supports other more sophisticated classes and methods to extract data.

Handling PDF files

There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.

This is a simple PDF file. It consists of several bullets:

  • Line 1
  • Line 2
  • Line 3

This is the end of the document.

A try block is used to catch  IOExceptions. The PDDocument class will represent the PDF document being processed. Its load method will load in the PDF file specified by the File object:

try { 
    PDDocument document = PDDocument.load(new File("PDF File.pdf")); 
    ... 
} catch (Exception e) { 
    // Handle exceptions 
} 

Once loaded, the PDFTextStripper class getText method will extract the text from the file. The text is then displayed as shown here:

PDFTextStripper Tstripper = new PDFTextStripper(); 
String documentText = Tstripper.getText(document); 
System.out.println(documentText); 

The output of this example follows. Notice that the bullets are returned as question marks.

This is a simple PDF file. It consists of several bullets: 
? Line 1 
? Line 2 
? Line 3 
This is the end of the document.

This is a brief introduction to the use of PDFBox. It is a very powerful tool when we need to extract and otherwise manipulate PDF documents.

Handling JSON

In Chapter 2, Data Acquisition we learned that certain YouTube searches return JSON formatted results. Specifically, the SearchResult class holds information relating to a specific search. In that section we illustrate how to use YouTube specific techniques to extract information. In this section we will illustrate how to extract JSON information using the Jackson JSON implementation.

JSON supports three models for processing data:

  • Streaming API - JSON data is processed token by token
  • Tree model - The JSON data is held entirely in memory and then processed
  • Data binding - The JSON data is transformed to a Java object

Using JSON streaming API

We will illustrate the first two approaches. The first approach is more efficient and is used when a large amount of data is processed. The second technique is convenient but the data must not be too large. The third technique is useful when it is more convenient to use specific Java classes to process data. For example, if the JSON data represent an address then a specific Java address class cane be defined to hold and process the data.

There are several Java libraries that support JSON processing including:

We will use the Jackson Project (https://github.com/FasterXML/jackson). Documentation is found at https://github.com/FasterXML/jackson-docs. We will use two JSON files to demonstrate how it can be used. The first file, Person.json, is shown next where a single person data is stored. It consists of four fields where the last field is an array of location information.

{  
   "firstname":"Smith", 
   "lastname":"Peter",  
   "phone":8475552222, 
   "address":["100 Main Street","Corpus","Oklahoma"]  
} 

The code sequence that follows shows how to extract the values for each of the fields. Within the try-catch block a JsonFactory instance is created which then creates a JsonParser instance based on the Person.json file.

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser(new File("Person.json")); 
    ... 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The nextToken method returns a token. However, the JsonParser object keeps track of the current token. In the while loop the nextToken method returns and advances the parser to the next token. The getCurrentName method returns the field name for the token. The while loop terminates when the last token is reached.

while (parser.nextToken() != JsonToken.END_OBJECT) { 
    String token = parser.getCurrentName(); 
    ... 
} 

The body of the loop consists of a series of if statements that processes the field by its name. Since the address field is an array, another loop will extract each of its elements until the ending array token is reached.

if ("firstname".equals(token)) { 
    parser.nextToken(); 
    String fname = parser.getText(); 
    out.println("firstname : " + fname); 
} 
if ("lastname".equals(token)) { 
    parser.nextToken(); 
    String lname = parser.getText(); 
    out.println("lastname : " + lname); 
} 
if ("phone".equals(token)) { 
    parser.nextToken(); 
    long phone = parser.getLongValue(); 
    out.println("phone : " + phone); 
} 
if ("address".equals(token)) { 
    out.println("address :"); 
    parser.nextToken(); 
    while (parser.nextToken() != JsonToken.END_ARRAY) { 
        out.println(parser.getText()); 
    } 
} 

The output of this example follows:

firstname : Smith
lastname : Peter
phone : 8475552222
address :
100 Main Street
Corpus
Oklahoma

However, JSON objects are frequently more complex than the previous example. Here a Persons.json file consists of an array of three persons:

{ 
   "persons": { 
      "groupname": "school", 
      "person": 
         [  
            {"firstname":"Smith", 
              "lastname":"Peter",  
              "phone":8475552222, 
              "address":["100 Main Street","Corpus","Oklahoma"] }, 
           {"firstname":"King", 
              "lastname":"Sarah",  
              "phone":8475551111, 
              "address":["200 Main Street","Corpus","Oklahoma"] }, 
           {"firstname":"Frost", 
              "lastname":"Nathan",  
              "phone":8475553333, 
              "address":["300 Main Street","Corpus","Oklahoma"] } 
         ] 
   } 
} 

To process this file, we use a similar set of code as shown previously. We create the parser and then enter a loop as before:

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser(new File("Person.json")); 
    while (parser.nextToken() != JsonToken.END_OBJECT) { 
        String token = parser.getCurrentName(); 
        ... 
    } 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

However, we need to find the persons field and then extract each of its elements. The groupname field is extracted and displayed as shown here:

if ("persons".equals(token)) { 
    JsonToken jsonToken = parser.nextToken(); 
    jsonToken = parser.nextToken(); 
    token = parser.getCurrentName(); 
    if ("groupname".equals(token)) { 
        parser.nextToken(); 
        String groupname = parser.getText(); 
        out.println("Group : " + groupname); 
        ... 
    } 
} 

Next, we find the person field and call a parsePerson method to better organize the code:

parser.nextToken(); 
token = parser.getCurrentName(); 
if ("person".equals(token)) { 
    out.println("Found person"); 
    parsePerson(parser); 
} 

The parsePerson method follows which is very similar to the process used in the first example.

public void parsePerson(JsonParser parser) throws IOException { 
    while (parser.nextToken() != JsonToken.END_ARRAY) { 
        String token = parser.getCurrentName(); 
        if ("firstname".equals(token)) { 
            parser.nextToken(); 
            String fname = parser.getText(); 
            out.println("firstname : " + fname); 
        } 
        if ("lastname".equals(token)) { 
            parser.nextToken(); 
            String lname = parser.getText(); 
            out.println("lastname : " + lname); 
        } 
        if ("phone".equals(token)) { 
            parser.nextToken(); 
            long phone = parser.getLongValue(); 
            out.println("phone : " + phone); 
        } 
        if ("address".equals(token)) { 
            out.println("address :"); 
            parser.nextToken(); 
            while (parser.nextToken() != JsonToken.END_ARRAY) { 
                out.println(parser.getText()); 
            } 
        } 
    } 
} 

The output follows:

Group : school
Found person
firstname : Smith
lastname : Peter
phone : 8475552222
address :
100 Main Street
Corpus
Oklahoma
firstname : King
lastname : Sarah
phone : 8475551111
address :
200 Main Street
Corpus
Oklahoma
firstname : Frost
lastname : Nathan
phone : 8475553333address :
300 Main Street
Corpus
Oklahoma

Using the JSON tree API

The second approach is to use the tree model. An ObjectMapper instance is used to create a JsonNode instance using the Persons.json file. The fieldNames method returns Iterator allowing us to process each element of the file.

try { 
    ObjectMapper mapper = new ObjectMapper(); 
    JsonNode node = mapper.readTree(new File("Persons.json")); 
    Iterator<String> fieldNames = node.fieldNames(); 
    while (fieldNames.hasNext()) { 
        ... 
        fieldNames.next(); 
    } 
} catch (IOException ex) { 
    // Handle exceptions 
} 

Since the JSON file contains a persons field, we will obtain a JsonNode instance representing the field and then iterate over each of its elements.

JsonNode personsNode = node.get("persons"); 
Iterator<JsonNode> elements = personsNode.iterator(); 
while (elements.hasNext()) { 
    ... 
} 

Each element is processed one at a time. If the element type is a string, we assume that this is the groupname field.

JsonNode element = elements.next(); 
JsonNodeType nodeType = element.getNodeType(); 
 
if (nodeType == JsonNodeType.STRING) { 
    out.println("Group: " + element.textValue()); 
} 

If the element is an array, we assume it contains a series of persons where each person is processed by the parsePerson method:

if (nodeType == JsonNodeType.ARRAY) { 
    Iterator<JsonNode> fields = element.iterator(); 
    while (fields.hasNext()) { 
        parsePerson(fields.next()); 
    } 
}

The parsePerson method is shown next:

public void parsePerson(JsonNode node) { 
    Iterator<JsonNode> fields = node.iterator(); 
    while(fields.hasNext()) { 
        JsonNode subNode = fields.next(); 
        out.println(subNode.asText()); 
    } 
}

The output follows:

Group: school
Smith
Peter
8475552222
King
Sarah
8475551111
Frost
Nathan
8475553333

There is much more to JSON than we are able to illustrate here. However, this should give you an idea of how this type of data can be handled.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.75