Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12. When Databases Attack: A Guide for When to Stick to Files

Tim McNamara

My Masters dissertation still feels likes a personal defeat. At least four months of the nine-month project were sabotaged because I didn’t understand the implications of the technology choices that I was making. This chapter will be a bit of a postmortem of the project and a walk-through of a strategy I should have used: storing plain-text data on-disk, instead of in a database.

History

But first, a little more about my story. In 2010, I was undertaking a Masters in Public Policy at Victoria University of Wellington. I was focusing my efforts on the arguments surrounding open data within the science sector. Specifically, I wanted to know: do the arguments that academics, officials, politicians, and the public align? I had the sense that open data and open government meant quite different things to different people, and I wanted to quantify that.

Getting access to information about what officials thought about open data was fairly easy. There was a major review of New Zealand’s publicly funded research bodies being undertaken at the time. Policy advice between departments is available under the Official Information 1982, which is New Zealand’s freedom of information statute. Information from politicians are even easier to find, as they talk all the time. All press releases are syndicated via scoop.co.nz. But what about the views of the public at large?

The political blogosphere is very active. While blogs have diminished in quality and depth with the maturity of social media, passionate people seem to draw themselves willingly into the flame of political debate.

Building My Toolset

I needed to learn text analysis and natural language processing (NLP) techniques, but my supervisor and I were confident that it would be possible to draw some inferences from this large corpus of material. The general approach was to crawl the New Zealand political blogosphere, identify discussion threads about open data, identify people who comment on this topic, then pull out the themes of these comments. I was also intending on spending some time classifying commenters. Perhaps we could see if there were things special about commenters interested in open data.

The dissertation itself was a success, but none of my work from the blogosphere could be included. Partway through the project, I believed the web crawling would be my greatest challenge. It looked easy at first. Blogs have articles. I wanted those articles. Sadly, they also have more than articles. They have comments, dates, locations, tags, and links. Commenters often have counts associated with them. Comments themselves have ratings.

This complexity in the data extraction process really slowed the development of the crawlers. I had several problems with missing data and exceptions. A breakage in the extraction process would break the crawl. A broken crawl that was not cached requires restarting the crawl from the beginning. I was able to clean up and streamline my extraction processes, though, so the crawling became less off a problem over time.

As it turned out, the real problem was storage, because I needed more ad-hoc query support than the tool could handle.

The Roadblock: My Datastore

The database that I chose to use was CouchDB. There were several reasons why I made this decision:

I do not have a computer science background, and I liked the idea of sending JSON strings to the database and for it to figure out how to store things appropriately.
It used an HTTP API, which I felt like I could understand.
The MapReduce paradigm promised high performance.
Documents are versioned.
CouchDB’s schemaless nature was a good fit, as I was consistently adding new fields to the JSON I generated from sucking it out of web pages.
I was quite interested in getting a deep understanding for the database that looked likely to be rather influential, if not preeminent within the web world.

CouchDB was approachable, friendly, and fast to get data into the system. The wider community seemed to be doing some very impressive things with it. While I am sure that fashion had a part of it, there were genuinely good reasons from a newcomer’s point of view that CouchDB was the right system. MapReduce is purportedly very scalable, and as CouchDB supports MapReduce queries, I had assumed that this would be the perfect fit. That said, there were two hurdles that I wasn’t able to surmont.

CouchDB provided a poor debugging experience for developing mappers and reducers. I had no idea how to interpret the Erlang/OTP stacktraces. Errors tended to pop up after some number of problems had occurred. I found little guidance about what was wrong with my JavaScript. The error messages were, effectively, “something broke.” So I would try again. And again. Finally, something would work. That said, the only reduce functions I ever implemented were sums. I could not do anything more complex. Because errors in reduce functions would only appear once all of the maps had been undertaken, the develop/run/break/debug/develop cycle would take (at least) several minutes per iteration. In the end, I decided to just create mapping fuctions and would write reducers manually with the resulting views.

The debugging story can be improved by slicing a small portion of the database out and developing new queries there. That way, with orders of magnitude fewer documents, problems arise much more quickly.

The real problem for me came for when I wanted to run a new query on the dataset I had created. There ended up being about 70GB work of blog posts, comments, and metadata within the database. (You may recall, I’d already stripped out the HTML formatting during the crawl stage. This was 70GB of actual text.) I learned that, to conduct NLP work within CouchDB’s mapping functions, building an index took days on my inadequate hardware. Worse still, things could still fail right near the end. That’s what happened to me. At that point, I decided to cut my losses. The data still sits on an old hard drive somewhere, silently taunting me.

This is not a chapter about the failings of CouchDB. It’s a chapter about how not understanding the tools that you’re running on can have very negative consequences. CouchDB is designed for a read-heavy workload, such as a high volume website. It processes new data for a view once that view has been requested by a client. In a high-volume website, new hits will appear all the time. It only needs to process a few documents at the most in order to serve the request. Under a web crawling scenario, this is not the case.

In case you were wondering, my dissertation passed well using the material from traditional sources.

In the sections that follow, I’ll explain how to use files and the filesystem as your database. Surprising though it may be, the filesystem can make for a very useful datastore.

Consider Files as Your Datastore

Sometimes, you’re better off skipping the database and using the filesystem as your datastore. Consider the following:

Files Are Simple!

Simplicity and ubiquity are virtues that are very hard to find in technology products. There is simply less bureaucracy to deal with when you use files.

Files Work with Everything

Just about everything will (sort of) know how to process your data. Utilities will need coaching to understand how to extract structure from your files. This is generally not a significant barrier to entry, though. If you stick with a few conventions, such as using common and consistent delimiters, then you will be fine.

Compare this to a traditional database system, for which you will often need a specific adapter, which may in turn have its own API. If you’re using an abstraction layer, such as an object-relational mapper (ORM), then you need to understand its unique syntax. Recent databases often provide an HTTP API, but this is still more of a burden than opening and reading a file.

Being able to have everything read your data makes life as a data wrangler easier. It’s quite easy to swap between programming languages and tools. (Perhaps you want to use R’s ggplot2 for plotting, Python’s NLTK for natural language processing, Octave for familiar matrix processing, and Fortran for numerical efficiency.) Hadoop and other implementations of MapReduce will typically work with files. It is easy to serve the data in ad-hoc fashion. A single line of Python can create a (temporary) web server: python -m SimpleHTTPServer 8080 will make the current directory contents available, via web browser, on your machine’s port 8080.

Also, OS tools can replicate functionality you may miss when you forego a traditional database. For example, the inotifywait tool can watch for changes and run a command when a given file is modified.

Files Can Contain Any Data Type

Traditional database engines do not handle large, unstructured file data (often referred to as binary blobs) well. Databases would prefer to work with fairly regularly sized records in order to efficently store tables and documents on disk. If records vary greatly in size, it is far more difficult to optimize how things are stored. Binary blobs get in the way.

By comparison, a file can contain any type of data, structured or unstructured.

This flexibility does, however, incur a cost when storing or retrieving typed data. For example, while a person will see that 3.14159 is a number, in a file this data is just a series of characters. Something has to tell your program that this is a number, and that may involve some translation overhead. Similarly, you can store objects in a file by defining a serialization process, but you then need to write a routine to _de_serialize that data in your application. That can be a burden, both in terms of developer time and processing time. Despite this, files can be surprisingly useful for getting off the ground.

Data Corruption Is Local

In case of a hard shutdown, any corruption to a database’s files can create a deep mess. Damage to files is generally localized to those files. Furthermore, file systems work hard to prevent damage in the case of power failure. (The disk contollers will sometimes lie to the file system and tell it that they’ve already written to disk when they haven’t. This can cause problems but they’re reasonably rare.)

File systems are really smart. They take care of many things for you. File systems are able to handle data which varies in size much better than databases. Databases don’t really like binary blobs that vary considerably in size. In general, database relations (aka tables) are optimized for rows of roughly consistent size. This enables them to optimize search strategies and so forth.

For example, journaling is a really useful feature for being able to ensure the stability of what’s happening. They just don’t care too much about data types. Everything is just bytes.

They Have Great Tooling

When you use a database system, you miss out on the powerful, built-in OS utilities to cut and slice the contents. Many such tools are heavily tested (having been run billions of times over the years), reliable, and are generally implemented in C for performance. grep is one of the fastest ways to hunt for things. Remote replication is only an rsync away. A word count for a file takes two characters: wc. A version control system such as Subversion or Git will help you maintain a history of your changes over time.

There’s No Install Tax

Working with files has a very low barrier to entry. You can eliminate a lot of friction by not needing to install client libraries, compile drivers, or worry about a schema.

Files certainly aren’t ideal for every situation. They’re best suited for cases that are read-heavy, require few modifications, and incur minimal cost to translate the data into a typed representation. That means storing logs (such as webserver logs), recorded events, web crawlings, large binary data, and sensor readings, which are all suitable use cases. Sensor readings do not change over time. History is fixed. Therefore, there is often little need to require features such as multiwrite concurrency. A similar case involves serving static web pages. Dynamic websites, which have been converted to files by a static site generator, are much more scalable than serving dynamically generated content.

File Concepts

Working with file datastores isn’t all roses. You’d do well to consider different file types, formats, and possible issues before you move to an all-file setup.

Encoding

An encoding is a standard for converting zeros and ones to characters. Traditional encodings, such as ASCII, have a translation table that converts a bit sequence such as 111100 (60 in decimal) to the capital letter A. Unfortunately, there are only 256 bit sequences within a byte. Unicode adds some complexity, but offers a lot more flexibility. One Unicode encoding, UTF-8, is variable length and can support annotating base characters with accents with information from supplementary bytes.

Text Files

Text files are, simply, files that adhere to an encoding. They are typically printable, in that the human eye can read their contents.

Binary Data

Think of this as files that do not conform to a well known-encoding for representing text. Therefore, while bytes from binary files may contain printable characters, the text printed will be gibberish. Some binary formats require special tools to translate the raw bytes into meaningful representations.

Memory-Mapped Files

Memory-mapped files mirror the in-memory representation of data, to improve performance. While CPU does not distinguish between data that comes from memory and data that comes from the disk, your users will, because the former can be tens of thousands of times faster than the latter.

File Formats

Formats set standards for how data is represented within a stream of consecutive bytes. Table 12-1 includes a few different representations of some public data about me:

Table 12-1. Format overview

Python literal syntax	`{'country': 'New Zealand', 'name': 'Tim McNamara'}`
JSON	`{"country": "New Zealand", "name": "Tim McNamara"}`
YAML	`{country: New Zealand, name: Tim McNamara}`
CSV	`name,country` `Tim McNamara,New Zealand`
Python pickle format	`(dp0` `S'country'` `p1` `S'New Zealand'` `p2` `sS'name'` `p3` `S'Tim McNamara'` `p4` `s.`
S expressions	`((country, "New Zealand"), (name, "Tim McNamara"))`

Each of these representations performs roughly the same function, with many details at the edges. There are several choices for what might be optimal. This lends itself to trivial debate, sometimes known as bikeshedding.^[59]

One way to decide about which format should be used is to consider the needs of its audience. If people are expected to read and potentially even edit the files, then one should optimize for human readability and editability. Syntax errors are fustrating to experienced programmers and intimidating to novices. If your audience for your data are computer parsers, then you should use something like Google Protocol Buffers,^[60] Apache Thrift,^[61] or JSON. If your audience is your future self or someone looking at the data in ten years’ time, you should use an open format that has existed for decades. When you use open formats, you increase the chances that you’ll actually be able to read the files in the future.

A file format is a method for translating files into information. There is no “best” file format. None of the representations above are very useful for storing image data, for example.

Lastly, it is often useful to wrap one format into another. When interacting with a web API, the JSON objects you receive are likely to be sent in a compressed state, such as gzip. (Please review the following sidebar for things to consider before you adopt JSON.)

JSON Is Not the World’s Best File Format

It is very likely that you will be using JSON all over the place. It is not necessary. Feel free to invent your own file format for particular jobs. If you really think that JSON is best, you should probably use YAML. YAML is a format that is designed primarily for human readability.

JSON should not be used in configuration files. Its syntax is far too strict to be helpful. JSON looks too similiar to other languages’ serialization formats, which leads to little bugs when you are actually writing things by hand.

Spot the bug here:

{
    "id": 10020,
}

JSON does not allow a comma within the last element. This is an unnecessary hurdle. Coming from Python, you may be even tempted to use single quotes. Or even worse, comments.

There are several advantages of YAML over JSON:

Readability: YAML omits most double quote characters. Fewer symbols make other little data things easier to read.
A defined comment syntax: JavaScript comments are illegal in JSON.
Named entities: This allows a great deal of compression because rather than including “University of Washington” everywhere, you can use the much shorter @wc or some other convenient variable name.
Decent streaming capabilities: YAML documents can be processed without reading everything into memory. This capability does not exist within JSON. If you are reading data in from a stream, then your parser needs to slurp everything into memory before it can deserialize it.

JSON is a very common format, but it is not well-suited for many of the tasks for which it has been allocated.

Delimiters

Delimiters separate fields and records within a file. For example, in a comma-separated file, the newline character separates records, and commas separate fields.

Picking a delimiter lives in a class of problems that sucks more energy than is possible to imagine. I know of a production system that uses four bars (||||) to delimit strings, to work around not being able to modify the database schema and a fear of modifying the underlying data to escape text.

Paying homage to http://xkcd.com/927/, the ASCII standard includes delimiter characters that are almost never used. ASCII 30 is the record separator and ASCII 31 is designated for delimiting fields. In principle, user-facing systems reading in this data could neatly align things together in a tabular form. In practice, there are no user-facing systems that I’m aware of that even support these delimiters, let alone do something nice for the user.

Probably the biggest reason for the lack of adoption is that it’s impossible to type the delimiters on a keyboard. CSV files look ugly and are misaligned, but we can all find the comma key.

You are best to split records on new lines and either use tabs or commas. Escaping with double quotes and back slashes is very common and likely to be easy enough to implement in bespoke systems. Whatever you do, try to avoid delimiter creep. If someone recommends |||| because it is “unlikely to cause problems,” then please steer them towards the light of common standards.

Finally though, whitespace delimiters are fairly prone to error and are generally clunky. In an increasingly international world, whitespace itself won’t work at all. Many Asian languages do not use whitespace to delimit characters. Perhaps it could be time to start thinking about actually implementing an old standard.

A Web Framework Backed by Files

As an example, consider a hypothetical content management system that uses files for its underlying datastore. While seemingly ludicrious, the concept may be quite suitable for small teams of technically able people. While we lose out on an attractive admin interface, we gain speed, security, and scalibility in exchange.

The concept is quite simple. We create plain text documents like this:

Title: Music to my ears

Was excited to read that people didn't simply move to the next greatest
database and build web framework to create a new website that no one will
read.

The computer then turns those documents into something that looks like this:

<!DOCTYPE html>
<html>
  <head>
    <title>Music to my ears</title>
  </head>
  <body>
    <p>Was excited to read that people didn't simply move to the next
      greatest database and build web framework to create a new website
      that no one will read.</p>
  </body>
</html>

Let’s discuss why this might be a useful thing to do, and then talk a little bit about how to actually implement it.

Motivation

Why would we want to do this? Here are a few answers:

Speed

Flat files imply that we will be serving static content. Static content is very easy and fast for web servers to serve to a client, in part because there’s no need to call a separate library, program, or remote system to generate content on-the-fly.

Static content is also very easy to cache. The file system provides a lot of metadata for its files, such as the time they were last modified. This can be provided to web clients, who will only ask for the page again if it has not been modified. The same rules apply for scaling out. If you are having difficulty handling requests, a content delivery network (CDN) will be very happy serve your static files for you. Dynamic content is much more difficult to hand off to an external, third-party service.

Many web developers will mention that the database is the bottleneck. Let’s take the bottleneck away: we’ll do the processing up front, rather than hitting the database with reads every time the website has a new request.

Security

Because there is no way to touch the database from the web content, our website will be resistant to intrusion, defacement, or worse.

Many security vulnerabilities occur because the outside world can manipulate services that are running on the server. For example, SQL injection is a means to run arbitrary commands on the database via things like web forms. The static file architecture prevents this. (Granted, someone could hack into the server itself—either through another, unrelated service running on the machine, or by infiltrating the data center—but this is a vulnerability of any computer system.)

Familiar tools

People are accustomed to using text editors. They know how to save files to disk. This process enables users to retain the tools with which they are familiar, rather than having to fiddle with clunky WYSIWIG editors built in JavaScript.

Remote users may have a few difficulties at first. They will need to use a tool like FTP, scp, or rsync to push their local content to the server. That might be overwhelming for some. However, these commands are fairly trivial for anyone who is comfortable working on a command line. You could provide these people with any cloud-based service for syncing files between computers.

Easy backup

Backups and restores are now easy, because the backup system only needs to talk to the standard filesystem. Every backup tool can handle plain text files.

By comparison, backing up and restoring databases can be a pain. It typically requires that you use special, database-specific tools, and some of those create special formats.

Implementation

Given what we’ve outlined above, the implementation is fairly straightforward. We have a directory of plain text documents that store our raw data. That directory is monitored by inotifywait or an equivalent tool. When the contents of that directory are modified, inotifywait calls a processing script with the file that has changed as an argument. That script needs to be able to create a new HTML document from the file’s contents when created or modified, and remove the HTML document upon deletion.

Tip

A working implementation of this system is available at http://github.com/timClicks/baddata/.

Reflections

In many regards, this chapter has been about conservatism. I encountered a negative experience with new and unfamiliar tools. That sent me back to rely on what I know. Files are approachable and everywhere.

It is very difficult to know if a particular technology will be suited for your use case. If you discover that it’s not, you can be too far down a particular track to escape and move to something else. The project can feel futile. You will develop workarounds for problems with the original stack, which may themselves need to be worked around at some later date. And so on.

Files are not a perfect solution, but they can be a good enough solution. Hopefully, if you start your data analysis project using files, you’ll buy yourself some time to evaluate several other tools. Sometimes waiting until you have collected the data you are interested in analyzing can be the best way to discover how to store it. Until then, you may be able to make do with files.

I hope that you have been able to extract lessons from my experience. My aim with this chapter has been to provide a different view on how to structure a data mining project. You have not found the universal truth here. Hopefully, what you have found is a series of useful tools that you can apply bits of in future projects. Thank you for reading. Kia kaha.

^[59]http://en.wikipedia.org/wiki/Parkinson%27s_Law_of_Triviality

^[60]http://code.google.com/p/protobuf/

^[61]http://thrift.apache.org/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.