Chapter 4. Data wrangling: from capture to domestication

This chapter covers

  • Ways to wrangle data
  • Helpful tools and techniques
  • Some common pitfalls

One definition of wrangling is “having a long and complicated dispute.” That sounds about right.

Data wrangling is the process of taking data and information in difficult, unstructured, or otherwise arbitrary formats and converting it into something that conventional software can use. Like many aspects of data science, it’s not so much a process as it is a collection of strategies and techniques that can be applied within the context of an overall project strategy. Wrangling isn’t a task with steps that can be prescribed exactly beforehand. Every case is different and takes some problem solving to get good results. Before I discuss specific techniques and strategies of data wrangling, as shown in figure 4.1, I’ll introduce a case study that I’ll use to illustrate those techniques and strategies throughout the chapter.

Figure 4.1. The third step of the preparation phase of the data science process: data wrangling

4.1. Case study: best all-time performances in track and field

While I was in graduate school, the Jamaican sprinter Usain Bolt began to astonish the world with amazing performances in both the 100 m and 200 m dashes. In the 2008 Olympic Games in Beijing, Bolt ran a world record 9.69 seconds in the 100 m dash despite celebrating his victory with outstretched arms well before he crossed the finish line. A few days later, Bolt’s time of 19.30 sec in the 200 m final broke Michael Johnson’s world record, which had been widely considered one of the best marks in the entire sport of track and field. Usain Bolt had claimed his position as the greatest sprinter in history, but he continued to improve.

At the 2009 World Championships in Berlin, Usain Bolt broke both of his own world records. He ran 9.58 sec in the 100 m dash and 19.19 sec in the 200 m dash, making all of his competitors, the fastest people in the world, appear average. Specifically, in the 100 m dash, American Tyson Gay set a national record—and the USA is traditionally good at sprinting—but was meters behind Bolt at the finish line. I was so impressed by Bolt’s performances, and so ubiquitous were discussions of “How good are Usain Bolt’s marks compared to other world records?” and “No other athlete has dominated like Bolt has,” that I decided to quantify the improbability of what Bolt had done. I wanted to settle the score for all of the armchair track and field enthusiasts out there who mostly made guesses and used anecdotes and heuristic methods to show how rare a given performance is.

4.1.1. Common heuristic comparisons

Armchair track and field enthusiasts tend to compare world records in different events by how old they are or how close anyone else has come to breaking them. Michael Johnson’s 200 m dash world record was 12 years old when Usain Bolt broke it, whereas the 100 m world record was less than a year old when Usain Bolt broke it the first time in early 2008, before breaking it again at the 2008 Olympics and the 2009 World Championships. The age of a world record probably does indicate some amount of strength of the record, but it’s certainly not a perfect measure. Was Bolt’s 19.19 sec mark for the 200 m worse than his 19.30 sec mark because the previous record was younger? Certainly not.

Sometimes people will cite the percentage improvement of a mark over the second-best mark as evidence that it’s good. The world record javelin throw by Jan Zelezný of 98.48 m is 2.9% farther than his second-best throw and 5.8% farther than anyone else’s ever. Usain Bolt’s still-current world record of 9.58 sec was 1.1% faster than his second-best time and 1.3% faster than the fastest mark by another person, which happened to be the second-place finisher in the same race, Tyson Gay. Again, this is a reasonable indicator of a good performance but is nowhere near perfect because of the high variance of those second-best performances. If, for some reason, the second-best performance had never happened, the percentage could change dramatically.

4.1.2. IAAF Scoring Tables

More sophisticated methods existed at the time I began this project in 2009, but they also had their shortcomings. The most widely accepted method for comparing performances between events in track and field among coaches, fans, and many others is a set of points tables called the IAAF Scoring Tables of Athletics. Typically every few years, an updated set of points tables is published by the International Association of Athletics Federations (IAAF). The IAAF also publishes the combined events points tables that are used in the multidiscipline events such as the men’s decathlon and women’s heptathlon, in which competitors are given points for their performances in each of the disciplines, and the winner is the athlete with the most total points. In combined events, the points tables are the basis for the competition itself. The Scoring Tables for individual events have little effect on competition, with the exception of certain track and field meetings that award prizes based on the tables.

The 2008 Scoring Tables, the most recent tables at the time Usain Bolt ran 9.58 sec in the 100 m dash, gave the world-record performance a score of 1374 (the tables’ highest score listed is 1400 points). In the next update to the Scoring Tables, in 2011, Usain Bolt’s 2009 performance received a score of 1356. This was quite a dramatic change in score, and such a large change did not occur in most events. For reference, according to the 2008 tables, 9.58 sec 100 m was equivalent to a 42.09 sec 400 m dash, whereas according to the 2011 tables, it was equivalent to a 42.37 sec 400 m. In an Olympic 400 m final, 0.28 sec is easily the difference between a gold medal and no medal. According to the Scoring Tables, Usain Bolt’s world record was getting worse.

There is a reason for this. The IAAF’s Combined Scoring Tables are known to be based on a relatively small set of the best performances in each event. The small set might be the top 10 or top 25, but I’m merely guessing, because the methods aren’t fully disclosed. If the tables are based heavily on a small set of the top performances, then a dramatic change in this set will dramatically affect the scores in the next update to the tables. The 2008 and 2009 track and field seasons produced some incredible 100 m performances, even without Usain Bolt. These performances affected the next set of scores, released in 2011. In some sense, through the Scoring Tables, Usain Bolt’s world-record performances and the strong performances of his rivals eventually made themselves less impressive.

By recounting all of this, I mean only to show how basing models on too little data can significantly distort results and conclusions. In the Scoring Tables, a few great performances within two or three years of great results drastically changed our impression of how good such performances are. I made it my goal to use all data I could find to generate a scoring method that would not only be less sensitive to changes in the best performances but would also be a good predictor of future level of performance, which represents a good set of out-of-sample data.

4.1.3. Comparing performances using all data available

My biggest question when starting out: how much data can I find? All manner of searching the internet and asking friends led me to conclude that alltime-athletics.com provides the most complete set of elite track and field performances available. The site provides lists of the top performances ever in all Olympic track and field events; for some events, the lists contain thousands of performances. I felt that if I used all of this data and a little statistical knowledge, I could improve on the robustness and predictive power of the IAAF’s Scoring Tables.

The first step was to wrangle the data. The lists of the top performances were on a website but weren’t available in a convenient format such as CSV, so I’d have to resort to web scraping in some way. In addition to that, I’d need to compare my scores and results with those from my leading competitor, the IAAF Scoring Tables, which were available only in PDF. Neither web pages nor PDFs are ideal for programmatic parsing, and both can get quite messy when considering HTML structure in web pages and page headers, footers, and numbers in PDFs. Simply put, it would take some wrangling.

Throughout the following sections, I’ll continue to refer to the two wrangling tasks in this track and field project:

  • Wrangling the lists of top performances from the website alltime-athletics.com
  • Wrangling the IAAF Scoring Tables from the PDF that contains them

4.2. Getting ready to wrangle

Some people like to dive in immediately and start throwing data around, but that’s not my style. I’m a bit more deliberate. I like to look around a bit before I write any code or commit to trying any strategies. I do a few things to gather some good information about the data that can help me wrangle more effectively.

In this section, first I show you what it might look like to have messy data that needs to be wrangled. Then I list a few steps that you can take before you begin to wrangle that are helpful for figuring out what you have, what you should do, and how much trouble you might be in.

4.2.1. Some types of messy data

The thing about messy data sets is that each set is messy in its own way. If all messy data looked the same, we would find a way to parse it and use it quickly and efficiently. Although I can’t possibly enumerate every way that data is messy and in need of wrangling, I can describe a few ways and hope that it helps you get an idea of the ways things can go wrong as you dive into a new data set and how you might be able to prepare for it.

If you’ve been working in data science for a few years, I’m sure you’ve seen cases like these—or worse. As of 2016, we still haven’t entered the Era of Clean Data, and I’m beginning to wonder if we’ll ever get there.

Data to be scraped

I mentioned web scraping in chapter 3, but you might also scrape data from PDFs and other unconventional sources. Scraping is the process of programmatically pulling selected elements from sources that weren’t designed to be accessed programmatically. The data to be scraped is usually poorly structured. But if you write a sophisticated scraper, the data may be neat and tidy at the end.

Corrupted data

Sometimes you may find data that’s not only poorly formatted but downright corrupted. This usually means that some aspect of the file is either missing or has been obfuscated because of a disk error or other low-level problem. It can be like losing the instructions for building a model plane or accidentally mixing up a stack of index cards that are supposed to be kept in order. A commonly corrupted file format is PST, an email archive. But thankfully we have many tools for recovering most of the data from such a corrupted file. (That’s one of the few benefits of having a file format that’s often corrupted: someone developed an anti-corruption tool!)

Poorly designed databases

Databases have growing pains too, and sometimes that means that two databases that weren’t intended to be used together are now your best sources of information. Sources of error include database values or keys that don’t match each other and incongruences in scope, depth, APIs, or schemas.

4.2.2. Pretend you’re an algorithm

I’ve already discussed web scraping in general, and given that the data I wanted for my track and field project was available on a website, scraping seemed like a good choice. In the Google Chrome browser I used the option View Page Source to see the raw HTML that composes the page that lists the top men’s 100 m dash results of all time. This is what a programmatic scraper will be reading.

Sometimes it’s easy to parse through HTML to extract the elements you want, but sometimes it takes a bit of imagination. I tend toward role playing; pretending to be a wrangling script—a bit of code that programmatically turns messy data into nice data—helps me figure out how I might write a script later, and it gives me a good idea of how difficult that task might be and what problems I might encounter.

The first step in wrangling, and in pretending to be a script, is to look at the raw data (for example, using View Page Source on a web page or looking at HTML in a text editor). This is what any code that you write will be seeing, and so it’s what you should look at when figuring out how to deal with it. Some header lines and other material at the top of the page, a section of the page containing the men’s 100 m dash data, look like this:

...
<A HREF="#7">faulty wind gauge - possibly wind assisted</a><br></FONT>
<center><p>
<A name="1"><H1>All-time men's best 100m </H1></a><P>
<PRE>
1    9.58    Usain Bolt    JAM    Berlin      16.08.2009
2    9.63    Usain Bolt    JAM    London      05.08.2012
3    9.69    Usain Bolt    JAM    Beijing     16.08.2008
3    9.69    Tyson Gay     USA    Shanghai    20.09.2009
3    9.69    Yohan Blake   JAM    Lausanne    23.08.2012
6    9.71    Tyson Gay     USA    Berlin      16.08.2009
...

This is the section of the page where the list of the best men’s 100 m performances starts. Because I’m writing this in 2016, there are performances on this list now that were not there when I first analyzed this data in 2011. But the format is the same.

Stepping into my role as a wrangling script, I imagine what I would do if I encountered this chunk of HTML. Nothing before this section is useful to me, so the main goal is to start capturing the top marks at this point. I recognize the line containing Usain Bolt’s 9.58 sec performance as the world record and beginning of the data I want to capture, but how will a script recognize it?

One way to recognize an athlete’s performance is to test each line of the file to see if it looks like this:

[INTEGER]    [NUMBER]    [NAME]    [COUNTRY]    [CITY]    [DATE]

Any method that tries to match each line of this HTML would have to be able to recognize integers, numbers, names, countries, and the like. This task is often more complex than it initially may seem. How, for example, would you test for a name in the third column? Would you test for capitalized words? Two words separated by a space? Would Joshua J. Johnson meet your criteria? What about Leonard Miles-Mills or the city Kuala Lumpur? All of these appear in the list. This process is less simple than it looks, and so I usually try a different strategy before resorting to pattern recognition.

Document structure, particularly in HTML or XML, can provide good clues to wrangling scripts about where the valuable data starts. What comes right before the data? In this case, the data is preceded by a <PRE> tag. Regardless of what this tag means, it’s worth checking to see if it appears often on the page or only right before the data set starts. Investigating this, I see that in the page, the first time the <PRE> tag appears is right before the data. And, thankfully, this is true of the pages for all track and field events, so I could safely use the <PRE> tag to denote the beginning of the data set for each of the events.

Imagining myself as a wrangling script, I would begin reading the HTML line by line, and I’d look for a <PRE> tag. When I find a <PRE> tag, I know that the next line will be a row of data in the particular table format, where the second column is the measured performance, the third column is the name, the fourth is the country, and so on. The text parser in your favorite scripting language can read that line, separate the columns into fields using either tab characters or multiple consecutive spaces, and store each text field in a variable or data structure.

4.2.3. Keep imagining: what are the possible obstacles and uncertainties?

Now that you know where the valuable data starts within each HTML page and how to begin capturing lines of data, you still have to continue playing the role, in your mind, of the wrangling script, as it continues line by line through the file.

As I scroll downward through the raw HTML of men’s 100 m performances, most lines look like they should: performance, name, country, and so on, all in their appropriate places. But every now and then there’s a funny character sequence. Sometimes in the city column I see the text &uuml; or &eacute; between some other letters. These entries seem weird, and I worry that a script (wait, that’s me!) might not know what to do. I look for the corresponding places in the rendered HTML (the web page itself) and realize that these character sequences are the HTML representations of ü and é, respectively. I also could perform an internet search to figure out what those character sequences mean.

Zürich and San José occur fairly often in the list of locations for some of the top performances, so it’s obvious I need to worry about characters that are represented by special character sequences in HTML. Are there others I haven’t seen yet? Should I be wary whenever an ampersand occurs in any field? The answers to each of these questions is likely yes, but I’d like to draw attention to the fact that I don’t know yet.

This is an important point in data wrangling: lots of things can happen that you might not expect. Therefore, if there is one tenet of wrangling, it is this:

Double-check everything.

A manual second check is preferred, but a programmatic check can work, too, if you’re careful. If I don’t have an absurd amount of data, as a second check I like to scroll through the post-wrangle, supposedly clean data files. Sometimes a quick scroll-through can reveal some obvious mistakes that an hour of software debugging would reveal. What if a column of data shifted somewhere? One extra tab character can mess up a lot of parsing algorithms, including some standard R packages and the occasional Excel import, among others.

What else might go wrong while trying to wrangle some data programmatically? What in the file(s) looks a little weird? What checks can you do afterward that might uncover some important mistakes in your wrangling? Every case is different, but every case deserves careful and deliberate thought about any of the possibilities for parsing error. Yet again, awareness is the most important thing a data scientist can have at this point.

4.2.4. Look at the end of the data and the file

If I’m pretending to be a wrangling script, I’m going to start at the beginning of the file and finish at the end. Many unexpected things can happen in the middle, so there are no guarantees that I’ll arrive at the end of the file in the expected state. I may have wrangled people’s names in the place of cities and dates in the place of countries of origin. Who knows, unless I’ve had the amazing luck to write a bugless script for a data set without irregularities? Does such a script or such a data set exist? Assuming that’s not the case—and assuming I’ve made at least one mistake in my current conception of a wrangling script—I probably should examine the wrangled data file(s) at the beginning, at the end, and at least a few places in the middle.

It takes a lot of scrolling through the list of men’s 100 m performances to find the first line of HTML that doesn’t belong in the data set. It took me an embarrassingly long time before I realized that there were additional, nonstandard lists of best performances at the bottom of the pages. For the men’s 100 m page, there are lists for running start and manual timing after the main list. These lists look identical to the main list but are separated from the main list by a few tags of HTML. The transition from the main list to the first auxiliary list looks like this:

...
2132  10.09    Richard Thompson    TTO    Glasgow      11.07.2014
2132  10.09    Kemarley Brown      JAM    Sundsvall    20.07.2014
2132  10.09    Keston Bledman      TTO    Stockholm    21.08.2014
</pre></center>
2401 total
<p><center>
<A name="2"><H3>rolling start</H3></A><P>
<PRE>
1     9.91    Dennis Mitchell      USA    Tokyo        25.08.1991
2     9.94    Maurice Greene       USA    Berlin       01.09.1998
3     9.98    Donovan Bailey       CAN    Luzern       27.06.2000
...

The HTML tag that denoted the beginning of the desired data, <PRE>, is closed at the end of the main list with the </pre> tag closure. (Note that HTML tags are generally not case sensitive.) This would be a good way to end the parsing of the data set. As a wrangling script, that’s what I’d do, unless I want to capture the auxiliary lists as well. Do I? In this case, no I don’t. I want only completely legal marks that abide by all world-record-eligible rules, because I want to compare world records and every performance that came close. It’s probably not helpful to consider performances under conditions that are nonstandard and inherently different from the core set of legal performances.

If my wrangling script ignored the end of the useful data set, in this case the </pre> tag closure, and assumed that the data continued until the end of the file, the script would probably collect the nonstandard results at the bottom of the page. Or the script wouldn’t know what to do when the data stopped fitting the appropriate column format that was established. In this case and many others, looking at the end of the wrangled data file is crucial to determining whether the data wrangling was successful. It’s up to you, the data scientist, to decide which aspects of wrangling are most important and to make sure those aspects are completed properly. But it’s almost always important to scan the file to the end, if for no other reason than to make sure that nothing weird happened right before the script finished.

4.2.5. Make a plan

Now that I’ve discussed the process of imagining yourself as a wrangling script, parsing through raw data and extracting the parts that are needed, the next step is to consider all the information that you’ve gathered so far and to make a plan for wrangling.

Based on what you’ve seen in the track and field data, a good option would be to download all the web pages containing all the Olympic track and field events and to parse them as we discussed, using HTML structure as appropriate, double-checking everything. One complication of this that I haven’t yet mentioned—if you’re going to use a pure web-scraping strategy—is that you need a list of all of the web addresses of the pages for individual events in order to download all of them programmatically. Sometimes such a list of site addresses to scrape is easy to generate, but each of the 48 individual pages (24 Olympic events each, for men and women) has a unique address that needs to be copied or typed manually. Therefore, you might need to manually create a list of 48 web addresses that needed to be scraped, plus you’d need to write the wrangling script that parses the HTML, as already discussed. That’s one potential plan that could wrangle the data you need. But because each page address needs to be copied or typed manually, I don’t think you’d save much time by programmatically visiting each of the manually typed page addresses when compared to downloading each of the pages manually. Those were two good options, but they weren’t the only ways to wrangle the data I needed.

In the end, I decided not to go with web scraping. Parsing through HTML structure wouldn’t have been incredibly hard, as you’ve seen, but I realized there was another way. The pages of alltime-athletics.com aren’t heavy with HTML. The HTML and the text that appears on the web page as rendered HTML aren’t too different. There’s not much styling on the page, particularly in the part of the page containing the data I needed. In both cases, the top performances were given as a whitespace-separated table, with one performance per line. In that way, there was little difference between raw HTML and rendered HTML. As I’ve already mentioned, however, some individual characters appeared differently in raw HTML than they did in rendered HTML. For example, in raw HTML, a city might appear as Z&uuml;rich but in rendered HTML it would appear as Zürich.

I decided to take the post-HTML, already rendered web page version so I didn’t have to worry about character rendering or any other HTML-parsing issues. I needed data from 48 pages, and instead of writing (or pretending to be) a script that would download each of these pages and parse the HTML for each Olympic event, I decided to copy the text from each of the web pages myself.

I would visit each of the 48 web pages with Chrome, my web browser, press Ctrl-A to select all text on each page, press Ctrl-C to copy the text, and then press Ctrl-V to paste the text from each event’s page into a separate flat file. It was some manual work to copy the pages, but I wouldn’t have to worry about translating HTML or scripting the downloading of the pages. That was the plan I settled on: skip the HTML parsing, copy all the pages manually, and then write a simple script that would pull the track and field performance data into a usable format. It might not be the best plan for every project, but it’s certainly good in this scenario.

In general, the choice of data wrangling plan should depend heavily on all of the information you discover while first investigating the data. If you can imagine parsing the data or accessing it in some hypothetical way—I try to play the role of a wrangling script—then you can write a script that does the same thing. Pretend you’re a wrangling script, imagine what might happen with your data, and then write the script later. Data wrangling is such an uncertain process that it’s always best to explore a bit and to make a wrangling plan based on what you’ve seen.

4.3. Techniques and tools

Data wrangling is an incredibly abstract process with uncertain outcomes at nearly every step. There’s no one way or one tool to accomplish the goal of making messy data clean. If someone tells you they have a tool that can wrangle any data, then either that tool is a programming language or they’re lying. Many tools are good for doing many things, but no one tool can wrangle arbitrary data. Honestly, I don’t think that will ever be possible, though we can certainly make progressively better tools. Data exists in so many forms and for so many purposes that it’s likely that no one application can ever exist that’s able to read arbitrary data with an arbitrary purpose. Simply put, data wrangling is an uncertain thing that requires specific tools in specific circumstances to get the job done.

4.3.1. File format converters

Among HTML, CSV, PDF, TXT, and any other common formats a file might take, it’s helpful to know when it’s possible to convert from one file format directly to another.

The IAAF Scoring Tables are published as a PDF, a format that’s not conducive to data analysis. But there are file format converters that can take PDFs and produce other file formats, such as text files and HTML. The Unix application pdf2txt extracts text from PDFs and saves it to a text file. pdf2html is another useful format converter and something that might be useful to most data scientists.

Many file format converters are available—many of them are free or open source—and it’s certainly worth a Google search to figure out whether the format you have is easily convertible to a format you want.

4.3.2. Proprietary data wranglers

In 2016, there’s no shortage of companies that want to wrangle your data for you—for a price. In some ways, it’s unsettling how many software products claim to be able to do this, when I know from personal experience that many of them are severely limited. On the other hand, some of these products work well.

If you can find any proprietary products that can convert the data you have into the data you want, then they’re well worth the consideration. Spending a little money on these proprietary tools may be worth it if your project gets done much earlier, but the industry is young and changing too fast for me to give any sort of meaningful survey here.

4.3.3. Scripting: use the plan, but then guess and check

Earlier in this chapter, I talked about pretending that I was a wrangling script and described what I would do in that role. This is exactly the basis for writing a real script for wrangling data.

As you imagine being a script and reading through files, recognizing where useful data begins and ends and how to parse that data, you should get an idea of how complex the task is and what the main features might be. For simpler tasks, simple tools are often useful for accomplishing them. The Unix command line, for instance, is helpful for tasks like these:

  • Extracting all lines from a file that contain a certain word (grep is one such tool)
  • Converting occurrences of a certain word or character into another (tr, sed, awk)
  • Cutting files into smaller chunks (split)

Those are a few, but if your wrangling plan entails a few such simple steps, the command line might be the way to go.

But if you have a more complex set of operations to perform, and none of the aforementioned file format converters or proprietary wranglers fits the bill, then a scripting language is usually your best bet. Both Python and R are common languages for writing wrangling scripts, and it’s usually best to choose whichever language you’re most comfortable with, because you’ll probably be trying several different techniques in a short amount of time.

Writing a wrangling script, at least the first version, is usually not a well-orchestrated affair. In fact, it fits nicely with a current usage of the word hacking to mean trying a bunch of things until you find a few that get the job done. Being able to load, manipulate, write, and transform data quickly is the most important capability you should strive for when choosing your scripting languages or tools.

Because you’ve already imagined being a script and parsing the data, you should have a good general idea what the final script will do, but I’m sure that some of the details are still unclear. You can make some informed decisions about how best to wrangle, or you can guess and check if that seems more time efficient. Now is also a good time to revisit the manual-versus-automate question: can you wrangle manually in a shorter time than you can write a script? And will you reuse the script? Go ahead and take whichever path seems like the biggest payoff for the lowest cost in time and effort.

Awareness is important while writing a wrangling script. This includes staying aware of the status of the data, the script, the results, what the goals are, and what each of the potentially many wrangling steps and tools is gaining you.

4.4. Common pitfalls

With messy data, any wrangling script conceived based on looking at only part of that data is bound to have omitted something important. Even if you were observant and thorough in considering all possibilities, there’s still a risk that something might go wrong. In this section, I outline a few of these pitfalls and give some noticeable symptoms that your wrangling script may have fallen into one.

4.4.1. Watch out for Windows/Mac/Linux problems

When I first began wrangling data, in 2005 or so, major data incompatibilities between the most popular operating systems were a problem that I didn’t think would still exist in 2016. But the three major OSs still haven’t agreed on a single way to denote line endings in text files. Any day now, all specs for text files might magically be unified across OSs, but until then there will be issues with conversion from files on one operating system to another.

Since time immemorial (since the 1970s), Unix—and later, by inheritance, Linux—has used the line feed (LF) denotation of a new line, whereas versions of Mac OS before version 9.0 used the carriage return (CR) character for a new line. Since 1999, Mac OS has joined the other Unix derivatives in using the line feed to denote a newline, but Microsoft Windows continues to use a hybrid CR+LF line ending.

I won’t attempt to address any specific problems here; they can be varied and nuanced. But I should stress that they can sneak up on you. Improperly parsing line endings can lead to various problems, so it pays to be wary of the output of the parsing. In addition to this and other OS complexities, each programming language has its own faculties for reading files of various types. Depending on which package or native method you choose to read files, make sure you pay attention to how it parses text, line endings, and other special characters. This information should be present in the language documentation.

When looking at your newly wrangled data, some signs that you may have encountered an issue with OS files formats include the following:

  • You seem to have far more lines of text than you think you should have.
  • You seem to have far too few lines of text.
  • Weird-looking characters are interspersed through the file.

4.4.2. Escape characters

When dealing with text, some characters have special meaning. For example, on the most common Unix (or Linux) shells (command-line interpreters), the character * is a wildcard that represents all files in the current directory. But if you precede it with the common escape character as in *, then the special meaning is removed, representing only the simple asterisk character. The line

rm *

removes all files in the current directory, whereas

rm *

removes the file named simply *, if it exists. In this case, the backslash character is said to escape the wildcard character.

Such escape characters can occur in text files and text/string variables in programming languages. For instance, let’s say you want to read a text file that looks like this in a text editor:

this is line 1
this is line 2
A    B    C    D

This is three lines of text, followed by a line containing no characters. The third line contains tab characters in between the letters.

Many common programming languages, including Python and R, upon reading this file with the simplest file reader (or file stream), would see this:

this is line 1
this is line 2
A	B	C	D

Notice that the line breaks have been replaced by and the tab characters by . This contains the same information as before, but it’s encoded in a single string, simply by using ASCII characters (n and t) with escape characters (backslash) to represent whitespace elements like the newline and the tab.

Another complication: creating string variables in languages such as Python or R includes the use of quotation marks, and quotation marks can affect escaping in various ways. Let’s put the previous string in double quotes, assign it to a variable in Python (the >>> represents the Python interpreter prompt), and then check the contents of the variable, first by entering only the variable name itself and then by using the print command:

>>> s = "this is line 1
this is line 2
A	B	C	D
"

>>> s
'this is line 1
this is line 2
A	B	C	D
'

>>> print s
this is line 1
this is line 2
A    B    C    D

Note that the two checks of the variable contents confirm that it’s a string (first shown in single quotation marks) and that this string renders as in the original file when making use of the print command, which translates the escaped characters as the appropriate newline and tab characters. In this way, a single string variable can represent the data from an entire file, with each line separated by the escaped newline character.

The fun begins when using escape characters within quotations or quotations within quotations. Let’s say you have a second file that looks like this in a text editor:

I call this "line 1"
I call this "line 2"
There are tabs in this quote: "A    B    C"

Storing this file in a single Python variable and then checking its contents looks like this:

>>> t = "I call this "line 1"
I call this "line 2"
There are Tabs in this quote: "A	B	C""

>>> t
'I call this "line 1"
I call this "line 2"
There are Tabs in this quote: "A	B	C"'

>>> print t
I call this "line 1"
I call this "line 2"
There are Tabs in this quote: "A    B    C"

That’s clearly a bit more complicated than the previous example. Because the text itself contains quotation marks, these need to be escaped within the string variable.

As a final example of escape characters, let’s assume you have a data set containing some emails. People could have written anything they want in the emails, including quotation marks, tabs, or anything else. For an unspecified reason, you want to store the email data in a text file, with one email per line. One email reads:

Dear Edison,

I dislike "escaped characters" .

-N

This this email could be encoded as the following string:

Dear Edison,

I dislike "escaped characters".

-N

And it could be stored in Python as a string variable like this, escaping the internal quotations:

>>> s = "Dear Edison,

I dislike "escaped characters".

-N
"

Now you want to write this email to one line of a file, so you check the contents of the variable and then print/render them (the print command replicates in the terminal what would be written to a file):

>>> s
'Dear Edison,

I dislike "escaped characters".

-N
'

>>> print s
Tesla,Edison,Dear Edison,

I dislike "escaped characters".

-N

This isn’t what you intended. The print process is rendering the newlines, and this data is no longer on a single row. It looks like you have to escape the newline characters again, as shown here:

>>> s = "Dear Edison,\n\nI dislike "escaped characters".\n\n-N\n"
>>> print a
Dear Edison,

I dislike "escaped characters".

-N

Now it’s printing on a single line, as intended, because you used double backslashes. Keep in mind that in the variable s, the sequence \n represents a literal backslash (only one—the first of the pair is the escaping one) and the character n. Sometimes this can be confusing, so it’s best to be careful when dealing with complex escapes.

It doesn’t have to stop at two backslashes. The backslash is a common escape character, and you may not have to worry much about it if you’re using smart file readers and writers, but if you’re using many nested quotation marks and newlines, escaping can be problematic. One time in R, I was using five or six backslashes in a row because of some nuanced escaping. It was quite astonishing for me to realize that I had to use \\\n to accomplish my task, because none of the file reader’s parameters seemed to get everything right.

Symptoms that you might have a problem with escaping may include the following:

  • Some lines or strings are too long or too short.
  • You try to read a file line by line but end up with one long line.
  • You find text inside quotation marks that doesn’t belong there.
  • You get errors while reading or writing files.

4.4.3. The outliers

Sometimes a data point contains a perfectly valid value, from a logical perspective, but that value isn’t realistic from the perspective of a subject matter expert. In our track and field example, at some point on the list of top 100 m dash times, the number 8.9 may have appeared. Now 8.9 is a perfectly logical number from a statistical perspective, but anyone who knows anything about track knows that no one has ever run 8.9 seconds in the 100 m dash. It might be that the 8.9 resulted in a typo somewhere along the line, or perhaps one row wasn’t formatted right, or any other imaginable reason for a data point to be in the wrong place. But the point is that sometimes incorrect data can sneak its way into your project without causing an error or otherwise making itself obvious. This is where summary statistics and exploratory graphs can be helpful.

In this case, merely checking the range of values—minimum to maximum—could have caught the error. During this project I plotted histograms of all of the data, not only to check for errors but also to gain some awareness about what the data sets looked like. A histogram probably would have highlighted a data error like this, though. It’s usually a good idea to spend a little time generating some summaries—statistical or visual—even if they don’t seem necessary. They can prevent errors and promote awareness. I cover data assessment, including basic descriptive statistics, summaries, and diagnostics, in chapter 5, and some of those techniques can be used here as well, to make sure the wrangling has been successful.

4.4.4. Horror stories around the wranglers’ campfire

Any seasoned data scientist has a story or two to tell: data ghost stories that haunt a data scientist for years, The One That Got Away, the near-miss when disaster was averted by a last-minute detection of The Bug, the time “the data was SO bad... (how bad was it?).” I have nothing more to add here; you’ll get most of my stories throughout this book, but don’t be afraid to ask your local veteran data scientists about theirs. The stories are probably pretty nerdy, but hey, aren’t we all?

Exercises

Continuing with the Filthy Money Forecasting personal finance app scenario first described in chapter 2, and relating to previous chapters’ exercises, imagine these scenarios:

1.

You’re about to begin pulling data from FMI’s internal database, which is a relational database. List three potential problems that you would be wary of while accessing and using the data.

2.

In the internal database, for each financial transaction you find that a field called description containing a string (plain text) seems to provide some useful information. But the string doesn’t seem to have a consistent format from entry to entry, and no one you’ve talked to can tell you exactly how that field is generated or any more information about it. What would be your strategy for trying to extract information from this field?

Summary

  • Data wrangling is the process of capturing the useful parts of raw and unkempt data under some bad conditions.
  • Think like a computer program in order to create a good data-wrangling script.
  • It helps to familiarize yourself with some typical tools and strategies and some common wrangling pitfalls.
  • Good wrangling comes down to solid planning before wrangling and then some guessing and checking to see what works.
  • Spending a little extra time on data wrangling can save you a lot of pain later.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.200.78