Chapter 12. Dealing with Large Data

In the previous chapter, we spoke of solutions to common problems that fall under the umbrella term of messy data. In this chapter, we are going to solve some of the problems related to working with large datasets.

Problems, in case of working with large datasets, can occur in R for a few reasons. For one, R (and most other languages, for that matter) was developed during a time when commodity computers only had one processor/core. This means that the vanilla R code can't exploit multiple processor/multiple cores, which can offer substantial speed-ups. Another salient reason why R might run into trouble analyzing large datasets is because R requires the data objects that it works with to be stored completely in RAM memory. If your dataset exceeds the capacity of your RAM, your analyses will slow down to a crawl.

When one thinks of problems related to analyzing large datasets, they may think of Big Data. One can scarcely be involved (or even interested) in the field of data analysis without hearing about big data. I stay away from that term in this chapter for two reasons: (a) the problems and techniques in this chapter will still be applicable long after the buzzword begins to fade from public memory, and (b) problems related to truly big data are relatively uncommon, and often require specialized tools and know-how that is beyond the scope of this book.

Some have suggested that the definition of big data be data that is too big to fit in your computer's memory at one time. Personally, I call this large data—and not just because I have a penchant for splitting hairs! I reserve the term big data for data that is so massive that it requires many hundreds of computers and special consideration in order to be stored and processed.

Sometimes, problems related to high-dimensional data are considered large data problems, too. Unfortunately, solving these problems often requires a background and mathematics beyond the scope of this book, and we will not be discussing high-dimensional statistics. This chapter is more about optimizing the R code to squeeze higher performance out of it so that calculations and analyses with large datasets become computationally tractable.

So, perhaps this chapter should more aptly be named High Performance R. Unfortunately, this title is more ostentatious, and wouldn't fit the naming pattern established by the previous chapter.

Each of the top-level sections in this chapter will discuss a specific technique for writing higher performing R code.

Wait to optimize

Prominent computer scientist and mathematician Donald Knuth famously stated:

Premature optimization is the root of all evil.

I, personally, hold that money is the root of all evil, but premature optimization is definitely up there!

Why is premature optimization so evil? Well, there are a few reasons. First, programmers can sometimes be pretty bad at identifying what the bottleneck of a program—the routine(s) that have the slowest throughput—is and optimize the wrong parts of a program. Identification of bottlenecks can most accurately be performed by profiling your code after it's been completed in an un-optimized form.

Secondly, clever tricks and shortcuts for speeding up code often introduce subtle bugs and unexpected behavior. Now, the speedup of the code—if there is any!—must be taken in context with the time it took to complete the bug-finding-and-fixing expedition; occasionally, a net negative amount of time has been saved when all is said and done.

Lastly, since premature optimization literally necessitates writing your code in a way that is different than you normally would, it can have deleterious effects on the readability of the code and your ability to understand it when we look back on it after some period of time. According to Structure and Interpretation of Computer Programs, one of the most famous textbooks in computer science, Programs must be written for people to read, and only incidentally for machines to execute. This reflects the fact that the bulk of the time updating or expanding code that is already written is spent on a human having to read and understand the code—not the time it takes for the computer to execute it. When you prematurely optimize, you may be causing a huge reduction in readability in exchange for a marginal gain in execution time.

In summary, you should probably wait to optimize your code until you are done, and the performance is demonstrably inadequate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.184.91