Using regular expressions

For research, you may need to download data from open-access websites or authentication-required databases. These data sources provide data in various formats, and most of the data supplied are very likely well-organized. For example, many economic and financial databases provide data in the CSV format, which is a widely supported text format to represent tabular data. A typical CSV format looks like this:

id,name,score 
1,A,20 
2,B,30 
3,C,25

In R, it is convenient to call read.csv() to import a CSV file as a data frame with the right header and data types because the format is a natural representation of a data frame.

However, not all data files are well organized, and dealing with poorly organized data is painstaking. Built-in functions such as read.table() and read.csv() work in many situations, but they may not help at all for such format-less data.

For example, if you need to analyze raw data (messages.txt) organized in a CSV-like format as shown here, you had better be careful when you call read.csv():

2014-02-01,09:20:25,James,Ken,Hey, Ken! 
2014-02-01,09:20:29,Ken,James,Hey, how are you? 
2014-02-01,09:20:41,James,Ken, I'm ok, what about you? 
2014-02-01,09:21:03,Ken,James,I'm feeling excited! 
2014-02-01,09:21:26,James,Ken,What happens?

Suppose you want to import this file as a data frame in the following format which is nicely organized:

      Date      Time     Sender   Receiver   Message 
1  2014-02-01  09:20:25  James    Ken        Hey, Ken! 
2  2014-02-01  09:20:29  Ken      James      Hey, how are you? 
3  2014-02-01  09:20:41  James    Ken        I'm ok, what about you? 
4  2014-02-01  09:21:03  Ken      James      I'm feeling excited! 
5  2014-02-01  09:21:26  James    Ken        What happens?

However, if you blindly call read.csv(), then you will see that it does not work out correctly. This dataset is somehow special in the message column. There are extra commas that will be mistakenly interpreted as separators in a CSV file. Here is the data frame translated from the raw text:

read.csv("data/messages.txt", header = FALSE)
## V1V2V3V4V5V6 
## 1 2014-02-01 09:20:25 James Ken Hey Ken! 
## 2 2014-02-01 09:20:29 Ken James Hey how are you? 
## 3 2014-02-01 09:20:41 James Ken I'm ok what about you?
## 4 2014-02-01 09:21:03 Ken James I'm feeling excited! 
## 5 2014-02-01 09:21:26 James Ken What happens?

There are various methods to tackle this problem. You may consider using strsplit() for each line and manually take out the first several elements and paste others for each line split into multiple parts. But one of the simplest and most robust ways is to use the so-called Regular Expression (https://en.wikipedia.org/wiki/Regular_expression). Don't worry if you feel strange about the terminology. Its usage is very simple: describe the pattern that matches the text and extract the desired part from that text.

Before we apply the technique, we need some basic knowledge. The best way to motivate yourself is look at a simpler problem and consider what is needed to solve the problem.

Suppose we are dealing with the following text (fruits.txt) that describes the number or status of some fruits:

apple: 20 
orange: missing 
banana: 30 
pear: sent to Jerry 
watermelon: 2 
blueberry: 12 
strawberry: sent to James

Now, we want to pick out all fruits with a number rather than with status information. Although we can easily finish the task visually, it is not that easy for a computer. If the number of lines exceeds two thousand, it can be easy for a computer with the appropriate technique applied and, by contrast, be hard, time-consuming, and error prone for a human.

The first thing that should come to our mind is that we need to distinguish fruits with numbers and fruits with no numbers. In general, we need to distinguish texts that match a particular pattern from the ones that do not. Here, regular expression is definitely the right technique to work with.

Regular expressions solve problems using two steps: the first is to find a pattern to match the text and the second is to group the patterns to extract the information in need.

Finding a string pattern

To solve the problem, our computer does not have to understand what fruits actually are. We only need to find out a pattern that describes what we want. Literally, we want to get all lines that start with a word followed by a semicolon and a space, and end with an integer rather than words or other symbols.

Regular expression provides a set of symbols to represent patterns. The preceding pattern can be described with ^w+:sd+$ where meta-symbols are used to represent a class of symbols:

  • ^: This symbol is used at the beginning of the line
  • w: This symbol represents a word character
  • s: This symbol is a space character
  • d: This symbol is a digit character
  • $: This symbol is used at the end of the line

Moreover, w+ means one or more word characters, : is exactly the symbol we expect to see after the word, and d+ means one or more digit characters. See, this pattern is so magical that it represents all the cases we want and excludes all the cases we don't want.

More specifically, this pattern matches lines such as abc: 123 but excludes lines otherwise. To pick out the desired cases in R, we use grep() to get which strings match the pattern:

fruits <- readLines("data/fruits.txt") fruits
## [1] "apple: 20" "orange: missing" 
## [3] "banana: 30" "pear: sent to Jerry" 
## [5] "watermelon: 2" "blueberry: 12" 
## [7] "strawberry: sent to James"
matches <- grep("^\w+:\s\d+$", fruits) 
matches
## [1] 1 3 5 6

Note that in R should be written as \ to avoid escaping. Then, we can filter fruits by matches:

fruits[matches]
## [1] "apple: 20" "banana: 30" "watermelon: 2" "blueberry: 12"

Now, we successfully distinguish desirable lines from undesirable ones. The lines that match the pattern are chosen, and those that do not match the pattern are omitted.

Note that we specify a pattern that starts with ^ and ends with $ because we don't want a partial matching. In fact, regular expressions perform partial matching by default, that is, if any part of the string matches the pattern, the whole string is considered to match the pattern. For example, the following code attempts to find out which strings match the two patterns respectively:

grep("\d", c("abc", "a12", "123", "1"))
## [1] 2 3 4
grep("^\d$", c("abc", "a12", "123", "1"))
## [1] 4

The first pattern matches strings that include any digit (partial matching), while the second pattern with ^ and $ matches strings that have only one digit.

Once the pattern works correctly, we go to the next step: using groups to extract the data.

Using groups to extract the data

In the pattern string, we can make marks to identify the parts we want to extract from the texts using parenthesis. In this problem, we can modify the pattern to (w+):s(d+), where two groups are marked: one is the fruit name matched by w+ and the other is the number of the fruit matched by d+.

Now, we can use this modified version of the pattern to extract the information we want. Although it is perfectly possible to use built-in functions in R to do the job, I strongly recommend using functions in the stringr package. This package makes it substantially easier to use regular expressions. We call str_match() with the modified pattern with groups:

library(stringr)
matches <- str_match(fruits, "^(\w+):\s(\d+)$")
matches
##      [,1]            [,2]         [,3]
## [1,] "apple: 20"     "apple"      "20"
## [2,] NA              NA           NA  
## [3,] "banana: 30"    "banana"     "30"
## [4,] NA              NA           NA  
## [5,] "watermelon: 2" "watermelon" "2" 
## [6,] "blueberry: 12" "blueberry"  "12"
## [7,] NA              NA           NA

This time the matches are a matrix with more than one column. The groups in parenthesis are extracted from the text and are put to columns 2 and 3. Now, we can easily transform this character matrix to a data frame with the right header and data types:

# transform to data frame
fruits_df <- data.frame(na.omit(matches[, -1]), stringsAsFactors =FALSE)
# add a header
colnames(fruits_df) <- c("fruit","quantity")
# convert type of quantity from character to integer
fruits_df$quantity <- as.integer(fruits_df$quantity)

Now, fruits_df is a data frame with the right header and data types:

fruits_df
##    fruit  quantity
## 1  apple      20
## 2  banana     30
## 3  watermelon  2
## 4  blueberry  12

If you are not sure about the intermediate results in the preceding code, you can run the code line by line and see what happens in each step. Finally, this problem is perfectly solved with regular expressions.

From the previous example, we see that the magic of regular expressions is but a group of identifiers used to represent different kinds of characters and symbols. In addition to the meta-symbols we have mentioned, the following are also useful:

  • [0-9]: This symbol represents a single integer from 0 to 9
  • [a-z]: This symbol represents a single lower capital letter from a to z
  • [A-Z]: This symbol represents a single upper capital letter from A to Z
  • .: This symbol represents any single symbol
  • *: This symbol represents a pattern, which may appear zero, one, or more times
  • +: This is a pattern, which appears one or more than one time
  • {n}: This is a pattern that appears n times
  • {m,n}: This is a pattern that appears at least m times and at most n times

With these meta-symbols, we can easily check or filter string data. For example, suppose we have some telephone numbers from two countries that are mixed together. If the pattern of telephone numbers in one country is different from that of the other, regular expressions can be helpful to split them into two categories:

telephone <- readLines("data/telephone.txt") 
telephone
## [1] "123-23451" "1225-3123" "121-45672" "1332-1231" "1212-3212" "123456789"

Note that there is an exception in the data. The number has no - in the middle. For unexceptional cases, it should be easy to figure out the pattern of the two types of telephone numbers:

telephone[grep("^\d{3}-\d{5}$", telephone)]
## [1] "123-23451" "121-45672"
telephone[grep("^\d{4}-\d{4}$", telephone)]
## [1] "1225-3123" "1332-1231" "1212-3212"

To find out the exceptional cases, grepl() is more useful because it returns a logical vector to indicate whether each element matches the pattern. Therefore, we can use this function to choose all records that do not match the given patterns:

telephone[!grepl("^\d{3}-\d{5}$", telephone) & !grepl("^\d{4}-\d{4}$", telephone)]
## [1] "123456789"

The preceding code basically says that all records that do not match the two patterns are considered exceptional. Imagine we have millions of records to check. Exceptional cases may be in any format, so it is more robust to use this method: excluding all valid records to find out invalid records.

Reading data in customizable ways

Now, let's go back to the problem we faced at the very beginning of this section. The procedure is exactly the same with the fruits example: finding the pattern and making groups.

First, let's look at a typical line of the raw data:

2014-02-01,09:20:29,Ken,James,Hey, how are you?

It is obvious that all lines are based on the same format, that is, date, time, sender, receiver, and message are separated by commas. The only special thing is that commas may appear in the message, but we don't want our code to interpret it as separators.

Note that regular expressions perfectly works with this purpose as it did in the previous example. To represent one or more symbols that follow the same pattern, just place a plus sign (+) after the symbolic identifier. For example, d+ represents a string consisting of one or more digital characters between "0" and "9". For example,"1","23", and"456" all match this pattern, while"word" does not. There are also situations where a pattern may or may not appear at all. Then, we need to place a * after the symbolic identifier to mark that this particular pattern may appear once or more, or may not appear, in order to match a wide range of texts.

Now, let's go back to our problem. We need to recognize a sufficiently general pattern of a typical line. The following is the pattern with grouping we should figure out:

(d+-d+-d+),(d+:d+:d+),(w+),(w+),s*(.+)

Now, we need to import the raw texts in exactly the same way as we did in the fruits example using readLines():

messages <- readLines("data/messages.txt")

Then, we need to work out the pattern that represents the text and the information we want to extract from the text:

pattern <- "^(\d+-\d+-\d+),(\d+:\d+:\d+),(\w+),(\w+),\s*(.+)$"
matches <- str_match(messages, pattern)
messages_df <- data.frame(matches[, -1])
colnames(messages_df) <- c("Date", "Time", "Sender", "Receiver", "Message")

The pattern here looks like some secret code. Don't worry. That's exactly how regular expression works, and it should make some sense now if you go through the previous examples.

The regular expression works perfectly. The messages_df file looks like the following structure:

messages_df
##      Date        Time    Sender   Receiver    Message 
## 1 2014-02-01   09:20:25  James    Ken         Hey, Ken! 
## 2 2014-02-01   09:20:29  Ken      James       Hey, how are you? 
## 3 2014-02-01   09:20:41  James    Ken         I'm ok, what about you? 
## 4 2014-02-01   09:21:03  Ken      James       I'm feeling excited! 
## 5 2014-02-01   09:21:26  James    Ken         What happens?

The pattern we use is comparable to a key. The hard part of any regular expression application is to find the key. Once we get it, we are able to open the door and extract as much information as we want from the messy texts. Generally speaking, how difficult it is to find that key largely relies on the difference between the positive cases and negative cases. If the difference is quite obvious, a few symbols will solve the problem. If the difference is subtle and many special cases are involved, just like most real-world problems, you need more experience, harder thinking, and many trials and errors to work out the solution.

Through the motivating examples mentioned earlier, you should now grasp the idea of regular expressions. You don't have to understand how it works internally, but it is very useful to become familiar with the related functions, whether they are built in or provided by certain packages.

If you want to learn more, RegexOne (http://regexone.com/) is a very good place to learn the basics in an interactive manner. To learn more specific examples and the full set of identifiers, this website (http://www.regular-expressions.info/) is a good reference. To find out good patterns to solve your problem, you can visit RegExr (http://www.regexr.com/) to test your patterns interactively online.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.141