Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The nitty gritty of cleaning text

Strings are used to support text processing so using a good string library is important. Unfortunately, the java.lang.String class has some limitations. To address these limitations, you can either implement your own special string functions as needed or you can use a third-party library.

Creating your own library can be useful, but you will basically be reinventing the wheel. It may be faster to write a simple code sequence to implement some functionality, but to do things right, you will need to test them. Third-party libraries have already been tested and have been used on hundreds of projects. They provide a more efficient way of processing text.

There are several text processing APIs in addition to those found in Java. We will demonstrate two of these:

Apache Commons: https://commons.apache.org/
Guava: https://github.com/google/guava

Java provides many supports for cleaning text data, including methods in the String class. These methods are ideal for simple text cleaning and small amounts of data but can also be efficient with larger, complex datasets. We will demonstrate several String class methods in a moment. Some of the most helpful String class methods are summarized in the following table:

Method Name	Return Type	Description
`trim`	`String`	Removes leading and trailing blank spaces
`toUpperCase`/`toLowerCase`	`String`	Changes the casing of the entire string
`replaceAll`	`String`	Replaces all occurrences of a character sequence within the string
`contains`	`boolean`	Determines whether a given character sequence exists within the string
`compareTo` `compareToIgnoreCase`	`int`	Compares two strings lexographically and returns an integer representing their relationship
`matches`	`boolean`	Determines whether the string matches a given regular expression
`join`	`String`	Combines two or more strings with a specified delimiter
`split`	`String[]`	Separates elements of a given string using a specified delimiter

Many text operations are simplified by the use of regular expressions. Regular expressions use standardized syntax to represent patterns in text, which can be used to locate and manipulate text matching the pattern.

A regular expression is simply a string itself. For example, the string Hello, my name is Sally can be used as a regular expression to find those exact words within a given text. This is very specific and not broadly applicable, but we can use a different regular expression to make our code more effective. Hello, my name is \w will match any text that starts with Hello, my name is and ends with a word character.

We will use several examples of more complex regular expressions, and some of the more useful syntax options are summarized in the following table. Note each must be double-escaped when used in a Java application.

Option	Description
`d`	Any digit: 0-9
`D`	Any non-digit
`s`	Any whitespace character
`S`	Any non-whitespace character
`w`	Any word character (including digits): A-Z, a-z, and 0-9
`W`	Any non-word character

The size and source of text data varies wildly from application to application but the methods used to transform the data remain the same. You may actually need to read data from a file, but for simplicity's sake, we will be using a string containing the beginning sentences of Herman Melville's Moby Dick for several examples within this chapter. Unless otherwise specified, the text will assumed to be as shown next:

String dirtyText = "Call me Ishmael. Some years ago- never mind how"; 
dirtyText += " long precisely - having little or no money in my purse,"; 
dirtyText += " and nothing particular to interest me on shore, I thought";  
dirtyText += " I would sail about a little and see the watery part of the world.";

Using Java tokenizers to extract words

Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.

Java core tokenizers

StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:

StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); 
while(tokenizer.hasMoreTokens()){ 
  out.print(tokenizer.nextToken() + " "); 
}

When we set the dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely...

StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.

The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
}

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,",");

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
}

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\d[^\w\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\d[^\w\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
}

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\W\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText);

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\W\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText);

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString());

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString());

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
}

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0;

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
}

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10));

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
}

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
}

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X"));

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\W\s", " "); 
out.println(text);

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced);

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12);

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12);

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   }

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last());

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
}

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second);

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString());

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString());

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList());

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString());

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString());

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString());

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
}

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
}

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
}

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
}

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael");

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
}

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael");

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
}

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat));

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\." + 
          "[0-9]{1,3}\])|(([a-zAZ\-0-9]+\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
}

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail"));

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    }

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail"));

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
}

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
}

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123"));

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \p{L} provides this flexibility. We also use \s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\p{L}\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
}

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau");

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for The nitty gritty of cleaning text

Create new playlist

Sign In

Sign Up

The nitty gritty of cleaning text

Using Java tokenizers to extract words

Java core tokenizers

Third-party tokenizers and libraries

Transforming data into a usable form

Simple text cleaning

Removing stop words

Finding words in text

Finding and replacing text

Data imputation

Subsetting data

Sorting text

Data validation

Validating data types

Validating dates

Validating e-mail addresses

Validating ZIP codes

Validating names

Table of Contents for
The nitty gritty of cleaning text