Chapter 8. Software: statistics in action

This chapter covers

  • The basics of some statistical software applications
  • Introductions to a few useful programming languages
  • Choosing the appropriate software to use or build
  • How to think about getting statistics into your software

Figure 8.1 shows where we are in the data science process: building statistical software. In the last chapter, I introduced statistics as one of the two core concepts of data science. Knowledge of software development and application is the other. If statistics is the framework for analyzing and drawing conclusions from data, then software is the tool that puts this framework into action. In few cases would a data scientist be able to go without software during a project, but I suppose it’s possible when the data set is very small.

Figure 8.1. An important aspect of the build phase of the data science process: statistical software and engineering

Beyond going without, a data scientist must make many software choices for any project. If you have a favorite program, that’s often a good choice, if for no other reason than your familiarity with it. But there can be good reasons to pick something else. Or if you’re new to data science or statistical software, it can be hard to find a place to start. Therefore, in this chapter I give a broad overview of different types of software that might be used in data science before providing some guidelines for choosing from among them for a project. As in chapter 7, I intend to provide only a high-level description of relevant concepts plus some examples.

Experienced software developers probably won’t like this chapter much, but if you’re a statistician or a beginner with software, I think these high-level descriptions are a good way to start. For more information on any specific topic, many in-depth references are available both on the internet and in print.

8.1. Spreadsheets and GUI-based applications

To anyone who has spent significant time using Microsoft Excel or another spreadsheet application, this is often the first choice for performing any sort of data analysis. Particularly if the data is in a tabular form, such as CSV, and there’s not too much of it, getting started with analysis in a spreadsheet can be easy. Furthermore, if the calculations you need to do aren’t complex, a spreadsheet might even be able to cover all the software needs for the project.

8.1.1. Spreadsheets

For the few who may be uninitiated, a spreadsheet is a piece of software that represents data in a row-and-column tabular format. It typically allows analysis of that data via sets of functions—such as average, median, sum—that can operate on the data and answer some questions. Microsoft Excel, OpenOffice and LibreOffice Calc, and Google Sheets are popular examples of spreadsheet applications.

Spreadsheets can be quite complex when they contain multiple sheets, cross-references, table lookups, and functions/formulas. The most sophisticated spreadsheet I ever made was for a college finance class on the topic of real estate. As part of the class, we participated in a simulation of a real estate market in which each student owned an apartment building, which required decisions regarding financing and insurance, among other things. During the simulation, there would be random events such as disasters that might cause damage to the building, as well as operating costs, vacancies, and interest rate fluctuation. The goal within the simulation was to have the highest rate of return on the initial cash paid for the apartment building, assuming we then sold the building after five years.

By far, the most important decision was the specific choice of mortgage used to finance the purchase of the building. We were given a choice of eight mortgage structures that differed in their term/duration, fixed or variable interest rates and their margins, points paid at purchase, and temporary teaser rates. There was a stated purchase price for the apartment building, but as in real life, we had a choice about how much money we’d like to borrow for the purchase. The amount we didn’t borrow via the mortgage was the cash outlay on which we’d be calculating the rate of return that measured our success.

With all the variables, random or chosen, included in the simulation, it was obvious that running through some calculations would be of great benefit when making decisions, particularly regarding financing. At the time, I had written programs only on my graphing calculator, and none of those analyzed data. Short of learning another tool for the purpose, using Microsoft Excel was my only choice. I have a mild dislike for Excel, as do many, but it got the job done.

For each of the 8 mortgage structures and 12 different possible amounts to borrow, I calculated cash flows for each of the 5 years of the simulation, and from these cash flows I also calculated an expected rate of return. I made use of several of Excel’s formulas, including the standard SUM formula for addition, the PMT formula for calculating a mortgage payment, and the IRR formula for calculating an internal rate of return for cash flows. At that point, I could see the rates of return for all possibilities under the expected conditions. As I mentioned, though, there would be random events during each year of the simulation, and so the expected outcomes almost certainly wouldn’t be the ones observed. To account for this, I added to the spreadsheet some values representing simulated disasters, interest rate fluctuations, and vacancies, among others. I then was able to enter some possible random outcomes and see how they affected the rates of return and the optimal choices.

Ultimately, my spreadsheet contained two sheets, one that did the heavy calculation and one that summarized results (shown in figure 8.2), random variables and their fluctuations, and the decisions that needed to be made. The sheet with the heavy calculations contained 96 different statements of cash flow, one for each of the eight mortgage types for 12 different amounts borrowed. Each statement of cash flow used the random variable values from the summary sheet to calculate five years of cash flows from several types of income and expenses based on that particular mortgage’s parameters. Each cash flow calculation resulted in a rate of return figure that the summary sheet then referenced. Looking at all 96 rates of return—using Excel’s conditional formatting option to highlight the highest among them—and how each of them changed if the random variables turned out differently, I then felt confident in choosing a mortgage that would result in not only one of the highest expected rates of return but that also wouldn’t be too risky if a disaster hit.

Figure 8.2. First page of the spreadsheet I used to simulate the management of an apartment building in my college finance class

I don’t think my classmates spent as much time analyzing choices as I did. I don’t believe any of them created a spreadsheet, either. I ended up winning the contest by more than a full percentage point, which the financially inclined among you will recognize as quite a large margin of victory. If I recall correctly, I earned over 9% on my initial investment, when everyone else earned less than 8%—a figure that would have amounted to several thousand dollars had the money been real. I did win real money also from the contest: $200 cash as the first prize, which, though not the thousands of dollars I earned on the fictional apartment building, was no small sum for me as a college student.

I probably learned more about spreadsheets during that single finance project than in the rest of my whole life. One advantage of spreadsheets, particularly Excel, became obvious: the number of built-in formulas is astronomical. For nearly any static calculation that you might care to apply to your data in a spreadsheet, there is a formula or a combination thereof that can accomplish it for you. Here, I use the word static to refer to calculations that happen in one step, without iteration or complex interdependency between values.

One major disadvantage of spreadsheets is that even moderately complex formulas might look something like this:

=-((-PMT((F15)/12,$B20*12,F$14,1))*12-(F$14-((1-((1+(F15)/12)^12-1)/((1+(F15)/12)^($B20*12)-1))*F$14)))

Or worse. I took that one directly from my real estate simulation spreadsheet, though it’s not immediately clear exactly how—even to me—this formula was once intended to calculate the interest expense for one year of the cash flow statement for one of the mortgages. Needless to say, this formula is difficult to read and understand—and it’s nowhere near the worst that I’ve seen or written—mostly because the entire formula is on a single line, and all the variables are referenced by unhelpful combinations of letters and numbers. Don’t get me started on the parentheses. Readability of such calculations can be a serious concern. Frankly, I’m not surprised that Excel formula errors have been implicated in many missteps by big banking organizations, such as the 2013 London Whale case at JP Morgan, in which, apparently, an Excel formula error caused dramatic underestimation of the risk of a particular investment portfolio, ultimately resulting in a $6 billion loss. That’s an expensive mistake and a compelling reason to use a statistical tool with readable calculation instructions.

Spreadsheets aren’t limited to rows, columns, and formulas. One other feature of note is the solver in Excel, a tool that can help you find the optimal solution for a complex equation or optimize parameters to meet some objective goal. To use it, you must tell the solver which values are allowed to change, and you also have to tell it which value is the objective that is to be maximized or minimized; the solver then uses optimization techniques to find the best solution. I used the solver, for example, during another college project in which I was trying to maximize the number of people a set of elevators could transport to their desired floors on a busy day. The optimal solution that I found involved limiting the floors that each elevator would visit; for example, one elevator would service floors 1–10 and another 10–15. By letting Excel’s solver change the floors that each elevator might visit, it found a solution that maximized the total number of people transported.

Another significant feature of most spreadsheet applications is the macro. Macros, in applications like Excel, are mini-programs that users can write themselves. In Excel, macros are written in Visual Basic for Applications (VBA) and can generally accomplish anything that can be accomplished in Excel. Instead of pointing and clicking to do something, you could theoretically write a macro that does the same thing. This can be helpful if you typically do that thing many times. Running one macro can be much faster and easier than doing a set of steps manually, depending on the complexity of those steps. I created a macro for my real estate spreadsheet, one that copied changes that I had made to the first cash flow statement and pasted them into all of the 95 other cash flow statements. Notably, I didn’t write the macro in VBA, but instead Excel was able to record me performing the manual steps of copying and pasting and then convert that recording into a macro I could use. This is a handy feature of Excel if you do a specific thing often, you want to create a macro for it, and you don’t want to write code in VBA.

8.1.2. Other GUI-based statistical applications

I consider spreadsheet applications like Excel to be level 1 software for data analysis; even beginners aren’t truly afraid to open a spreadsheet and poke around a little bit. Level 10 is writing your own flawless software from scratch; it’s not for the faint of heart. The levels in between are populated by any number of software applications that possess varying amounts of ease of use, like spreadsheets, and versatility, like programming. If you’re finding your favorite spreadsheet application lacking in sophistication, but you’re not ready to jump head first into a programming language, then some in-between solutions might work for you.

I was at a big statistics conference recently and was surprised by the number of vendors offering mid-level statistical applications. I saw booths for SPSS, Stata, SAS (and its JMP product), and Minitab, among others. Each of these applications is popular in some circles, and I’d used three out of those four at some point in my life. I decided to play a bit of a game by visiting a couple of the booths when they weren’t busy and asking each company’s representatives to explain to me why their software product is better than the others. It was an honest question, more than anything, but I also thought it funny to ask the representatives, mostly statisticians themselves, so directly to disparage the competition. One rep was kind enough to admit straight away that the companies were all producing roughly equivalent products. That is not to say that they are equal, but only that none is vastly or definitely better than the others. Preference for one or the other is a matter of specific use and personal taste.

What unites the various companies’ core products is a similar experience in doing statistical analysis. It seemed to me that the graphical user interface (GUI) of each application was based on that of a spreadsheet. If you take a typical spreadsheet application and divide the screen into panels, with one panel showing the data, one displaying some graph or visualization of the data, and one describing the results of a regression or component analysis, that’s the basic experience of these mid-level statistical applications. I know I sound dismissive in my description, so please don’t infer that I think these applications aren’t worthwhile—quite the contrary. In fact, everything I know about them tells me that they’re incredibly useful. But software companies, as companies will, sometimes use hyperbole to make you think that their product is head and shoulders above the competitors’ and that it will solve your data problems in a jiffy. This is almost never true, though I’ll admit we’re doing much better in the current decade than in decades past. One thing does remain true: to use a statistical application successfully, you have to understand statistics, at least a little. None of these applications will teach you statistics, so approach them with a cool head and a discerning eye. Often the best choice, if you want to choose one, is the one that your friends use, because they can help you if you run into questions. Following the crowd here can be a good thing.

These mid-level statistical applications do much more than look like a spreadsheet with panels. Their facilities for performing many different statistical analyses are generally far greater than Excel or any other spreadsheet program. It’s my impression that these software tools were built specifically to enable point-and-click data analysis with a level of sophistication that spreadsheets can’t approach. If you want to use a statistical technique that you can’t find in Excel’s menus, then you may want to level up to one of the aforementioned applications. Regression, optimization, component analysis, factor analysis, ANOVA (analysis of variance), and any number of other statistical methods are often done far better in a mid-level tool than in a spreadsheet, if it’s even possible in the latter.

Beyond point-and-click statistical methods, these mid-level tools offer far greater versatility through their associated programming languages. Each of the proprietary tools I’ve mentioned has its own language for performing statistical analysis, which can be superior to point and click in both versatility and repeatability. With a programming language, you can do more than with clicking, and you can also save the code for future use, so you know exactly what you’ve done and can repeat it or modify it as necessary. A file containing a sequence of commands that can be run is generally called a script and is a common concept in programming. Not all languages can do this. Scripting languages are particularly useful in data science because data science is a process consisting of steps, actions performed on the data and models.

Learning the programming language of one of these mid-level tools can be a good step toward learning a real programming language, if that’s a goal of yours. These languages can be quite useful on their own. SAS, in particular, has a wide following in statistical industries, and learning its language is a reasonable goal unto itself.

8.1.3. Data science for the masses

Recent years have seen much ado about bringing data science to the analyst, the person who can directly use the intelligence gleaned from the data science itself. I’m wholly in favor of enabling everyone to gain insight from data, but I’m still not convinced that folks not trained in statistics should be applying them. Perhaps it’s the bias of a statistician’s ego, but in the face of uncertainty—as always in data science—recognizing that something in the analysis is going wrong takes training and experience.

With that said, in this age of information, data, and software startups, it seems that someone is constantly claiming they make data science and analysis easy for everyone, even for beginners, and so you probably don’t need to hire a data scientist. To me, this is akin to saying that a web compendium of medical information can take the place of all medical professionals. Sure, in many cases, the web compendium or the new data science product can give equal or even superior results to those of the professionals, but only the professionals have the knowledge and experience to check the necessary conditions and recognize when a particular strategy goes wrong. I’ve emphasized in this book that awareness of possibilities in the face of uncertainty is a key aspect of data science, and I don’t think this will ever change. When the stakes matter, experience is incredibly important.

It’s tempting to list a few of the new additions to the buffet of statistical applications in order to illustrate what I mean. But I’m not confident that most of them will be around in a few years, and besides that, the names don’t matter. In the past decade, data science has become big business and a highly sought-after skill. Whenever a skill makes news and money is spent acquiring it, there are always companies that claim to make it easy, with varying degrees of success. I challenge you, as a data scientist, to challenge newcomers to the software product industry who claim to trivialize our work. I write while trying to transcend my own ego: do they think they can accomplish with software what we can, maybe, accomplish only with careful consideration of the needs and wants of the project at hand, plus software and statistics?

I digress, but I do want to share my (healthy, I think) skepticism for the analytic software industry. If anyone claims to have a magic pill for a seriously challenging task, please be skeptical. There are many good statistical software applications out there; it’s extremely unlikely that a new one can gain a huge advantage, but hey, I’ve been proven wrong before. New products are almost always variations on an old theme or purpose-built tools that are very good at one thing. Both can be useful, but figuring out the true range of applicability of something new can take some effort.

8.2. Programming

When an out-of-the-box software tool can’t—or doesn’t—cut it, you need to make your own. I mentioned in the previous section that the most popular mid-level statistical applications possess their own programming languages that can be used to extend functionality arbitrarily. With the possible exception of SAS and its tool set, none of these programming languages is usually considered as a language independent of its parent application—the language exists because of and in tandem with the application.

There also exists a converse to this—otherwise standalone programming languages for which a GUI-based statistical application has been built. Both iPython for Python and RStudio for R are popular examples. The market of statistical applications can get a little confusing; I often see internet forum posts asking whether, for example, SPSS or R is better to learn. I can see how, maybe, the RStudio GUI and the SPSS GUI can seem roughly equivalent, but the real functionality of those two tools is vastly different. The choice between the two—or any mid-level statistical tool and a programming language—boils down to your desire to learn and use a programming language. If you don’t want to write any code, don’t choose any tool based on R or any other programming language.

Programming languages are far more versatile than mid-level statistical applications. Code in any popular language has the potential to do most anything. These languages can execute any number of instructions on any machine, can interact with other software services via APIs, and can be included in scripts and other pieces of software. A language that’s tied to its parent application is severely limited in these capacities.

If you’re intimidated by code, programming isn’t as difficult as it might seem. Other than some short programs on my graphing calculator in high school, I didn’t write any code until a summer internship during college, and I didn’t begin in earnest until grad school. I sometimes wish I had begun earlier, but the fact remains that it isn’t hard to pick up some programming skills if you’re diligent.

8.2.1. Getting started with programming

I wrote my first programs on a graphing calculator during high school, but they weren’t sophisticated. The only programming course I took in college was called Object-Oriented Programming; I did fine in the course, but for some reason I didn’t program much at all until a summer internship at the Department of Defense, where I used MATLAB to analyze some image data. Between then and graduate school, during which I used R and MATLAB in bioinformatics applications, I slowly learned about these languages and programming in general. It’s not as if there was one point in my life when I decided to learn how to program, but rather I learned various aspects of programming as I needed them for various projects.

I don’t necessarily recommend this method of learning. In fact, there were plenty of times in which my lack of formal training led me to reinvent the wheel, as they say. It certainly helps to know what languages, tools, conventions, and resources are available for you to use before you begin. But there’s so much available that it can be hard to know where to start if you know little about software engineering in general.

It was only after graduate school, when I started working at a software company, that I gained valuable experience with Java, object-oriented programming, several different types of databases, REST APIs, and various useful coding conventions and good practices that most software developers probably already know and that I’ve found helpful. And all of it was real-life data science in the wild, so to speak, so it was doubly helpful for me to learn. For those who were like me a few years ago and don’t have much pure software development experience, I’ll share the things that I wish I had known as I was getting started. Hopefully, this section will help beginning programmers understand how certain aspects of programming relate to others and feel more comfortable using them, discussing them, and searching for more information about them.

Scripting

A program can be as simple as a list of commands to be executed in order. Such a list of commands is usually called a script, and the act of writing one is called scripting. In MATLAB (or the open-source clone GNU Octave), a script can be as simple as this:

filename = "timeseries.tsv";
dataMatrix = dlmread(filename);
dataMatrix(2,3)

The lines of this script set a variable named filename to the value "timeseries.tsv", load the file whose name is contained in filename into the variable named data-Matrix, and then print on the screen the value contained in the second row and third column of dataMatrix. After installing MATLAB or Octave, there’s very little preventing a beginning programmer from writing a simple script like this. The file timeseries.tsv needs to be in a particular format—in this case, tab-separated values (TSV)—in order for the built-in function dlmread to be able to read it properly. And you’d have to know about dlmread and a little about the syntax of the language, but it’s easy to find good examples online and elsewhere.

My main point in showing this extremely simple example is that it’s not that hard to get started with programming. If you have a spreadsheet, you can export a sheet of numeric values to TSV or CSV format and load the data as I’ve shown here, and you can immediately interact with the data via such a script, adding or removing commands to accomplish what you want.

An important thing to note is that the commands of such a script can be run in both an interactive language shell as well as an operating system’s shell. That means you have a choice of writing a script and then telling your OS (how exactly depends on your OS) to run the whole script, or opening the interactive language shell and entering the commands directly within that shell. The OS method of running the script is generally more portable, and the interactive shell has the advantage of allowing line-by-line execution interspersed with other commands, edits, and checks that you might find helpful. For example, let’s say you forgot what’s in the file timeseries.tsv. In the interactive shell (to get to the interactive shell, open MATLAB, and the shell prompt appears immediately on the screen) you could run the first two lines of the previous script to load the file and then you could type dataMatrix at the prompt and press Enter to display the contents of that variable, which were loaded from timeseries.tsv. For a small file, this can be handy. An alternative is to open the file in Excel and look at the values there, which for some people is at least as appealing. But let’s say you wanted to see the two-millionth line of a table contained in a CSV. Excel would probably have serious trouble loading the file; as of this writing, no version of Excel can handle files with two million lines. But MATLAB would have no problem with it. To inspect the two-millionth line, load the file as shown earlier, and enter dataMatrix(2000000,:) on the interactive shell prompt.

I spent most of my early programming years on the interactive shells of MATLAB and R. My typical workflow would be this:

  1. Load some data into a variable in the interactive environment.
  2. Play around with the data, inspecting and calculating some results.
  3. Copy the useful commands from step 2 into a file (script) for reuse later.

As my commands became more complex, the script file became longer and more complex. Most days, as I was analyzing data, and particularly in an exploratory phase, I would end up with a script containing exactly the commands that could return me to the state I was in so that I could continue my work on a later date. The script would load into the interactive environment whatever data sets I needed, make data transformations or calculations, and generate graphs and results.

For years this is the way I wrote programs: scripts that were created almost as a side effect of poking and prodding at data in an interactive shell environment. This haphazard way of writing software isn’t ideal—I’ll come back to some good coding conventions later in this section. But I do still encourage beginners to use it, at least at first, because it’s so easy to get started this way, and you can learn a lot about a language by trying various commands from an interactive shell.

I still use scripts quite often, but they have their limits and disadvantages. If you find yourself not being able to read and comprehend your own scripts because they’re too long or complex, it may be time to use other styles or conventions. Likewise, if you’re copying and pasting a section of a script into another script, you should probably look for an alternative—what happens when you make a change to a copied section of one script but forget to change the other scripts in the same section? If a script or set of scripts seem complex or difficult to manage, it’s probably time to consider using functions or objects in your code; I’ll discuss those later in this section.

Switching from spreadsheets to scripts

The most common functionality of Excel and other spreadsheets can be replicated in scripting languages by using built-in commands such as sum and sort (versions of these are in every language) as well as using logical constructs involving if and else to check data for certain conditions. Beyond these, each language has thousands of functions (just like in Excel) that you can use in your commands. It’s a matter of figuring out which function does what you want and how it works.

There’s one basic type of command that programming languages handle well and spreadsheets don’t: iteration. Iteration is repetition of a set of steps, possibly using different values each time. Let’s say that timeseries.tsv is a table of numeric values, where each row represents the purchase of some items from a store. The first column gives the item number, ranging from 1 to some number n, which uniquely identifies which item was purchased. The second column gives the quantity of the item purchased, and the third column gives the total amount paid. The file might look like this:

3    1    8.00
12   3    15.00
7    2    12.50

But it’s probably longer if the store is successful.

Let’s assume you want the total quantity sold of item 12, and the file has thousands of rows. In Excel, you could sort by item number, scroll to the rows for item 12, and then use the SUM formula to add up the quantities. That might be quickest, but it’s a decent amount of manual work (sorting, scrolling, and summing), particularly if you expect to do this with other items as well, after item 12. Another option in Excel is to create a new fourth column and use the IF formula in each cell of that column to test the first column for being equal to 12, giving 1 or 0 as a result (for true or false), multiplying the second and fourth columns row by row, and then adding up all the row products. This is also a fairly fast option, but if you want to see the sums for other items as well, you have to be clever about what formula you use and make sure you put the item number in a cell by itself—and refer to it in each of column 4’s formulas—so that you can change the item number only in one place in order to get its total quantity.

Either of those Excel solutions can work if you want the quantity for only one or a few items, but what if you wanted total quantities for all item numbers? The sorting strategy would get quite tedious, as would the creating of extra columns, regardless of whether you created one extra column and then changed the item number or you created a new column for each item number. There’s probably a better solution in Excel involving conditional lookups or searching for values, but I can’t come up with it offhand. If you know Excel inside and out, you probably know much better than I do how to solve this problem. The point I’m trying to make is that this type of problem is easy to do in most programming languages. In MATLAB, for example, after reading the file into the variable dataMatrix as shown earlier, you need only write the following:

[ nrows, ncolumns ] = size(dataMatrix);
totalQuantities = zeros(1,1000)
for i in 1:nrows
  totalQuantities( dataMatrix(i,1) ) += dataMatrix(i,2);
end

This code first finds the number of rows, nrows, in dataMatrix and then initializes a vector totalQuantities of 1000 values, starting at zero, to which the quantities will be added. The kth entry in totalQuantities, accessed by the command total-Quantities(k), will give the total quantity sold of item k. If item numbers are above 1000, you’d need to make this vector longer. The for loop, which is the common name for the type of code structure between for and end, iterates through the rows of dataMatrix (in each iteration, the current row is given by the variable i) adding the value contained in row i, column 2 of dataMatrix to its proper place in totalQuantities, which is given by the value in row i, column 1 of dataMatrix.

To some people who haven’t written code before, this example may seem more complicated than using Excel or some other tool they know. But I think it should be clear that almost anyone can get started with a scripting language and interactive shell and be producing results with only a bit of knowledge.

Functions

I mentioned previously that scripts can sometimes grow long, complicated, and difficult to manage. You might also end up with multiple scripts that have some common steps or commands—if, for example, different scripts need data loaded in the same way, or different scripts load different data but process the data sets in the same way. If you find yourself copying and pasting code from one file to another and you intend to continue using both, it’s time to think about creating a function.

Let’s say you need to use the previous data file, timeseries.tsv, in multiple scripts. You can create a function in MATLAB by creating a file called loadData.m with the following contents:

function [data] = loadData()
  filename = "timeseries.tsv";
  data = dlmread(filename);
end

And then you can include the following line in a script

dataMatrix = loadData()

in order to load the data from timeseries.tsv into the variable dataMatrix as before. In my scripts, I’ve merely replaced two lines with one, but if you change the filename or its location, among other things, now you’ll have to do that only in the function and not in various scripts. This also becomes more useful when you have more than two lines of code that you’re sharing between scripts.

Functions, by design and like mathematical functions, can take inputs and give outputs. In the previous case, there are no inputs, but there is an output, the variable data; it’s specified as the output on the first line of the function file. Within a script, functions also don’t change any variables except the one to which the function output is assigned—in this case, dataMatrix. In that way, using functions is a good way to isolate code for the purpose of reuse between scripts.

Functions are also good for understanding what code does. Within your script, it’s easy to read and understand the function call loadData() when most of the time you probably don’t care how exactly the data is loaded, as long as it’s working correctly. If loading data consisted of many steps, pushing those steps into a function would greatly improve the readability of that part of the script. If you have many chunks in your script that work together for a single purpose, like loading data, then you might want to push each of them into its own function and turn your script into a sequence of well-named function calls. Doing so is usually much easier on the eyes.

For your reference, functional programming is a somewhat popular programming paradigm—in contrast to object-oriented programming—in which functions are first-class citizens; functional programs emphasize the creation and manipulation of functions, even to the point of having anonymous functions and treating functions as variables. The main thing to know about functional programming is that, under the strict paradigm, functions have no side effects. If you call a function, there are inputs and there are outputs, and no other variables from the calling environment are affected. It’s as if the inner workings of a function occur somewhere else, in a completely separate environment, until the function returns its output.

Whether or not you care about the specifics of the functional paradigm and its theoretical implications, functions themselves are useful for encapsulating generally cohesive blocks of code.

Object-oriented programming

Functions and functional programming can be said to be concerned with action, whereas object-oriented programming can be said to be concerned with things, but those things can also perform actions. In object-oriented programming, objects are entities that may contain data as well as their own function-like instructions called methods. An object usually has a cohesive purpose, much like a function would, but it behaves differently.

For example, you could create an object class called DataLoader and use it to load your data. I haven’t used much object-oriented functionality in MATLAB, so I’m going to switch to Python here; more on Python later in this chapter. In Python, the class file named dataLoader.py might look like this:

import csv

class DataLoader:

  def __init__(self,filename):
    self.filename = filename

  def load(self):
    self.data = []
    with open(self.filename,'rb') as file:
      for row in csv.reader(file,delimiter='	'):
        self.data.append(row)

  def getData(self):
    return self.data

The first line imports the csv package that’s used to load the data from the file. After that, you see the word class and the name of the class, DataLoader. Inside the class, indented, are three method definitions.

The first input for each of the methods is the variable self, which represents this instance of the DataLoader object itself. The methods need to be able to refer to the object itself because methods are able to set and change the object’s attributes, or state, which is an important aspect of object-oriented programming, as I’ll illustrate.

The first method defined, __init__, is the one that’s called when the object is created, or instantiated. Any object attributes that are essential to the existence and functionality of the object should be set here. In the case of DataLoader, the __init__ method takes a parameter called filename, which is the filename from which the object will load the data. When a DataLoader object is instantiated in a script via a command such as

dl = DataLoader('timeseries.tsv')

the single input parameter is passed to the __init__ method for use by the object. The definition of __init__ shows that __init__ assigns this input parameter to the object attribute self.filename. Object attributes are set and accessed this way in Python. After instantiating the object via the __init__ method, the object, which is now called dl in the script (or interactive shell environment), has its self.filename attribute set to timeseries.tsv, but the object contains no data.

To load data from the named file into the object, you must use the load method via the command

dl.load()

which, according to the method definition, creates an attribute for the data as an empty list, self.data, opens the file named by self.filename, and then loads the data line by line into self.data. After this method is executed, all data from the file has now been loaded into the self.data attribute of the DataLoader object called dl.

If you want to then access and work with the data, you can use the third method defined, getData, which uses a return statement to pass the data as output to the calling script, much like a function.

Given the object class definition, a script that loads and gets the data for use might look like the following:

from dataLoader import DataLoader

dl = DataLoader('timeseries.tsv')
dl.load()
dataList = dl.getData()

Even though this accomplishes the same thing as my original script in MATLAB or one that uses functions, the way in which things are accomplished in this script, which uses an object, is fundamentally different. The first thing that happens after importing the class definition is the creation of a DataLoader object called dl. Inside dl is a separate environment where the object’s attributes are preserved, conceptually out of sight of the main script. Functions don’t have attributes and are therefore stateless, whereas an object’s attributes—its state—can be used during method calls to affect the results or output of the method. Objects and their methods can have side effects.

Side effects and preserving state attributes can have big advantages, but they can also be dangerous if you’re not careful. Functions, strictly speaking, should never change the value of an input variable, but objects and their methods can do so. On some occasions, I’ve constructed objects with methods that manipulated their input variables in some way that was convenient for calculating the method’s output. Several times in my life, before I learned to be more careful, I didn’t realize that in manipulating the values of the input within a method I was also affecting their values outside the method and object, in the calling script or environment. Mathematicians and other functionally minded people don’t always consider that type of unintended side effect at first.

One big advantage of using an object is that objects offer a good way to gather a bunch of closely related data and functions into a single self-contained entity. In the case of loading data, the main script probably doesn’t care where the data is coming from or how it’s being loaded. In this scenario, in which certain groups of variables or values never interact directly with one another, according to the object-oriented paradigm they should probably be sequestered from each other in some way, and containing them within their own objects can be a good option to accomplish this. Likewise, any functions that deal almost exclusively with such a group of variables should probably be converted into object methods in the object that contains those variables.

The main advantages of separating program data, attributes, functions, and methods into purposefully cohesive objects are primarily in readability, extensibility, and maintenance. As with well-written functions, well-constructed objects allow code to be more easily understood, more easily expanded or modified, and more easily debugged.

Between functional and object-oriented programming, neither is absolutely better than the other. Each has advantages and disadvantages, and you can often borrow from both paradigms to write code that works best for your purposes. It’s important, however, to be careful with the states of your objects, because if you’re mixing functions and objects, it might be easy to treat an object method like a function when it has side effects.

One thing I’d like to note: throughout this section, I talk about a calling script or main script as if there is always one master script that makes use of a set of functions and objects. This is not always the case. I wrote that way for clarity for beginners, but in practice functions can call functions and use objects, objects can call functions and use objects, and there might not even be a script at all, which is the topic of the next section.

Applications

In this context, I use the term application in contrast with script. A script, as I’ve discussed, is a sequence of commands to be executed. I use the word application to mean something that, when started or opened, becomes ready for use and to perform some task or action for the user. In that way, an application is conceptually similar to an object in programming, because an object can be created and then used in various ways by whatever created it. A script, on the other hand, is more similar to a function, because it merely completes a sequence of actions in a straightforward fashion.

Spreadsheets are applications, as are websites and apps on mobile devices. They’re all applications in the sense I described earlier, because when they’re started, they initialize but might not do much until the user interacts with them. It’s primarily in this interaction with the user that applications are useful, unlike scripts, which generally produce a tangible result that provides some usefulness.

It would be hard to provide enough useful information here to get you started into application development. It’s somewhat more complicated than scripting, so I’ll point out a few concepts that might be useful to a data scientist. For more information, you’ll find plenty of references and examples on the internet.

The main reason application development has become useful for data scientists is that sometimes delivering a static report to your customer isn’t enough. I’ve often delivered to a customer reports that I feel are quite thorough, only to be asked for more detail on certain points. Sure, I could provide the customer with more detail in specific areas, maybe in an additional report, but only after going back to the data myself and extracting those details. I might have been better off delivering a piece of software that allowed my customer to delve deeper into each aspect of the report that they found interesting. I could have accomplished this perhaps by setting up a database and a web application that allowed the customer to click around through nice representations of the results and data in a standard web browser. This is precisely how most analytic software companies are delivering their product these days. All the analytics are behind the scenes, whereas a web application delivers the results in a way that’s friendlier and more useful than a report.

Data scientists also create applications that consume and analyze data in real time or otherwise interactively. Google, Twitter, Facebook, and many other websites use fairly complex analytic methods that are continually delivered to the web via applications. Trending topics, top stories, and search results are all the product of data-science-heavy applications.

But, as I said, creating an application isn’t as straightforward as writing a script, so I’ll save the space, skip the detail here, and suggest that you consult references written by someone far better informed than I am. But if you’re interested and you already have some experience with a programming language, have a look at these frameworks for building web applications in data-friendly languages:

  • Flask for Python
  • Shiny for R
  • Node.js for JavaScript, plus D3.js for awesome data-driven graphics

8.2.2. Languages

I’ll now introduce three scripting languages that I’ve used for data science and related programming tasks, to compare and contrast them, not to describe them in great detail. The languages are GNU Octave (an open-source clone of MATLAB), R, and Python. For these languages, I give example code, in each case using the item sales quantity example from section 8.2.1, and a timeseries.tsv data file in the same format I described. I define one function for loading the data, and then the scripts tally the quantities sold for all items (up to 1000 of them) and print to the screen the total quantity sold of item 12.

After discussing the three scripting languages, I talk a little about a language that isn’t a scripting language (so an example script isn’t possible) but that is important enough to software development in general and data science as well that I feel I should mention it. This language is Java.

MATLAB and Octave

MATLAB is a proprietary software environment and programming language that’s good at working with matrices. MATLAB costs quite a bit—over $2000 for a single software license as of this writing—but there are significant discounts for students and other university-affiliated people. Some folks decided to replicate it in an open-source project called Octave. As Octave has matured, it has become closer and closer to MATLAB in available functionality and capability. Excepting code that uses add-on packages (a.k.a. toolboxes), the vast majority of code written in MATLAB will work in Octave and vice versa, which is nice if you find yourself with some MATLAB code but no license. In my experience, you’ll most likely need to change some function calls for compatibility but not necessarily many. I once got several hundred lines of my own MATLAB code running in Octave after finding approximately 10 lines of code containing incompatibilities.

The near-total mutual compatibility, when not using add-on packages, is great, but Octave falls short in performance as well. From what I can gather from the internet, MATLAB can be two or three times faster than Octave for numerical operations, which sounds about right from my own experience, though I haven’t done a direct comparison. Because both MATLAB and Octave are designed for matrix and vector operations, you have to write vectorized code if you want to take full advantage of the languages’ efficiencies. Apparently, if you have code that isn’t vectorized but should be, such as using for loops to multiply matrices, MATLAB is better at recognizing what’s happening and compiling the code so that it’s almost as fast as if it was explicitly vectorized. Octave might not be able to do this yet. In any case, making use of the efficiencies of vectorized code when working with vectors and matrices will always make your code faster, sometimes dramatically. This is true in MATLAB, Octave, R, and Python, among others.

Writing vectorized code in MATLAB and Octave is extremely easy for anyone familiar with matrix operations, because the code looks exactly like the equivalent mathematical expression. This isn’t true for the other languages I discuss here. For example, if you have two matrices, A and B, of compatible dimensions, multiplying them in the standard sense (not entry-wise) can be accomplished by writing

A * B

whereas an entry-wise product for matrices A and B of identical dimensions is performed via

A .* B

Both of these operations are vectorized. Note that if the mathematical expression calls for transposing a matrix or vector, you likewise have to transpose it in the code or you may get an error or an incorrect result. The transpose operator is a single quote following the matrix, as in A'.

For reference, non-vectorized equivalents of these, matrix multiplication would probably involve at least two nested for loops over the rows and columns of the matrices. The vectorized versions are both easier to write and faster to execute, so it’s best to vectorize whenever possible. Both R and Python can use vectorization as well, but multiplying matrices in those languages defaults to the entry-wise version of multiplication, though alternatives are provided for standard matrix multiplication.

As a slightly more informative introduction to MATLAB and Octave syntax, you can implement the item sales quantity example by creating a file named loadData.m containing the function loadData by writing the following lines:

function [data] = loadData()
  data = dlmread("timeseries.tsv");
end

In the same directory, create a script file named itemSalesScript.m containing the following lines:

dataMatrix = loadData();
[ nrows, ncolumns ] = size(dataMatrix);
totalQuantities = zeros(1,1000);
for i = 1:nrows
  totalQuantities( dataMatrix(i,1) ) += dataMatrix(i,2);
end

totalQuantities(12)

You can run this script in Octave from a Unix/Linux/Mac OS command line via the command

user$ octave itemSalesScript.m

and the output written to the command line should be the total quantity sold of item 12 appearing in your data file. To run the same code at the MATLAB or Octave prompt, copy and paste the contents of itemSalesScript.m to the prompt and press Enter.

I would use MATLAB or Octave in the following situations:

  • If I’m working with large matrices or large numbers of matrices.
  • If I know that a particular add-on package, particularly in MATLAB, will be greatly useful.
  • If I have a MATLAB license and I like the matrix-friendly syntax.

I would not use MATLAB or Octave in these circumstances:

  • If I have data that isn’t well represented by tables or matrices.
  • If I want my code to integrate with other software; it can be difficult and complicated because of MATLAB’s relatively narrow set of intended applications, though many types of integrations are possible.
  • If I want to include my code in a software product to be sold. MATLAB’s license in particular can make this difficult legally.

Overall, MATLAB and Octave are great for engineers (in particular electrical) who work with large matrices in signal processing, communications, image processing, and optimization, among others.

For a look at some of my Octave code (ported from MATLAB once upon a time) from a few years ago, see my bioinformatics project on gene interaction on GitHub at https://github.com/briangodsey/bacon-for-gene-networks. The code is pretty messy, but please don’t fault me for past sins. In a way, that code and that of my other bioinformatics projects represent a kind of snapshot of my hybrid scripts-and-functions coding style at the time.

R

My first note about R: if you want to search for help on the internet, putting the letter R in the search box can lead you to some funny places, though search engines are getting smarter every day. If you have trouble getting good search results, try putting the acronym CRAN in the box as well; CRAN is the Comprehensive R Archive Network and can help Google (and other search engines) direct you to appropriate sites.

R is based on the S programming language that was created at Bell Labs. It’s open source, but its license is somewhat more restrictive than some other popular languages like Python and Java, particularly if you’re building a commercial software product.

R has some idiosyncrasies and differences from the other languages I present. It typically uses the symbol <- to assign a value to a variable, though the equals sign, =, was later added as an alternative for the convenience of people who preferred it. In contrast to MATLAB, R uses square brackets instead of parentheses for indexing lists or matrices, but MATLAB is the weird one there; most languages use square brackets for indexing. Whereas both MATLAB and Python allow the creation of objects like lists, vectors, or matrices beginning with a square bracket, R does not. For example, in both MATLAB and Python, you can use the assignment A = [ 2 3 ] to create a vector/list containing a 2 and a 3, but in R, you’d need to use A <- c(2,3) to accomplish something similar. It’s not a huge difference, but it’s something I forget if I’ve been away from R for a while.

Compared to MATLAB, in R it’s easier to load and handle different types of data. MATLAB is good at handling tabular data but, generally speaking, R is better with tables with headers, mixed column types (integer, decimal, strings, and so on), JSON, and database queries. I’m not saying that MATLAB can’t handle these but that it’s generally more limited or difficult in implementation. In addition, when reading tabular data, R tends to default to returning an object of the type data frame. Data frames are versatile objects containing data in columns, where each column can be of a different data type—for example, numeric, string, or even matrix—but all entries in each column must be the same. Working with data frames can be confusing at first, but their versatility and power are certainly evident after a while.

One of the advantages of R being open source is that it’s far easier for developers to contribute to language and package development wherever they see fit. These open-source contributions have helped R grow immensely and expand its compatibility with other software tools. Thousands of packages are available for R from the CRAN website. I think this is the single greatest strength of the R language; chances are you can find a package that helps you perform the type of analysis you’d like to do, so some of the work has been done for you. MATLAB also has packages, but not nearly as many, though they’re usually very good. R has good ones and bad ones and everything in between. You’ll also find tons of R code that’s freely available in public repos but that might not have made it to official package status.

During my years of bioinformatics research, R was the most commonly used language by my colleagues and our peers at other institutions. Most research groups developing new statistical methods for bioinformatics create an R package or at least make their code available somewhere, as I have for one of my projects, an algorithm called PEACOAT, on GitHub at https://github.com/briangodsey/peacoat.

You can implement the item sales quantity example in R by creating the file itemSalesScript.R containing the following code:

loadData <- function() {
  data <- read.delim('timeseries.tsv',header=FALSE)
  return(data)
}

data <- loadData()
nrows <- nrow(data)
totalQuantities <- rep(0,1000)
for( i in 1:nrows ) {
  totalQuantities[data[i,1]] <- totalQuantities[data[i,1]] + data[i,2]
}

totalQuantities[12]

You can run this script in R from a Unix/Linux/Mac OS command line via this command:

user$ Rscript itemSalesScript.R

Or from the shell prompt in an R environment, copy and paste the contents of itemSalesScript.R into the shell and press Enter.

Besides syntax and function name changes between MATLAB and R, you may notice that the basic structure is the same. R uses curly braces, {}, for function definitions and for loops, compared to MATLAB’s use of an end command to denote the end of the code block. The function also uses an explicit return statement, which isn’t present in MATLAB.

I would use R in these situations:

  • If I’m working in a field for which there are many R packages.
  • If I’m working in academia, particularly bioinformatics or social sciences.
  • If I’d like to load, parse, and manipulate varied data sets quickly.

I would not use R in these circumstances:

  • If I’m creating production software.
  • If I’m creating software to be sold. The GPL license has implications.
  • If I’d like to integrate my code into software in other languages.
  • If I’d like to use object-oriented architecture. It’s not great in R.

Overall, R is a good choice for statisticians and others who pursue data-heavy, exploratory work more than they build production software in, for example, the analytic software industry.

Python

First and foremost, Python is the only one of the three scripting languages I present here that wasn’t intended to be primarily a statistical language. In that way, it lends itself more naturally to non-statistical tasks like integrating with other software services, creating APIs and web services, and building applications. Python is also the only language of the three that I’d seriously consider using for creating production software, though in that respect Python still falls short of Java, which I’ll discuss next.

Python, like any language, has its idiosyncrasies. The most obvious one is its lack of braces to denote code blocks, with not even an end command like MATLAB to say when a for loop or function definition has ended. Python uses indentation to denote such code blocks, to the eternal chagrin of many programmers everywhere. It’s common convention to indent such code blocks, but Python is one of the only languages that forces you to do so, and it’s certainly the most popular among them. The gist is that if you want such a code block to end, instead of typing end like in MATLAB or using a close brace, }, like in R, Java, and many others, you stop indenting your code. Likewise, you must indent your code immediately following a for command or a function definition line containing def. You’ll get an error during execution otherwise.

Likely because Python was originally a general-purpose programming language, it has a robust framework for object-oriented design. By contrast, the object-oriented features of R and MATLAB seem like an afterthought. I’ve grown to like object-oriented design, even for simple tasks, so I use this feature quite often because Python has become my primary programming language in the last few years.

Although Python wasn’t originally intended to be a heavily statistical language, several packages have been developed for Python that elevate it to compete with R and MATLAB. The numpy package for numerical methods is indispensable when working with vectors, arrays, and matrices. The packages scipy and scikit-learn add functionality in optimization, integration, clustering, regression, classification, and machine learning, among other techniques. With those three packages, Python rivals the core functionality of both R and MATLAB, and in some areas, such as machine learning, Python seems to be more popular among data scientists.

For data handling, the package pandas has become incredibly popular. It’s influenced somewhat by the notion of a data frame in R but has since surpassed that in functionality. Admittedly, I had some trouble figuring out how to make profitable use of pandas when I first tried it, but after some practice it was very handy. It’s my impression that pandas data frames work as in-memory, in-Python optimized data stores. If your data set is big enough to slow down calculations but small enough to fit in your computer’s memory, then pandas might be for you.

One of the most notable Python packages in data science, however, is the Natural Language Toolkit (NLTK). It’s easily the most popular and most robust tool for natural language processing (NLP). These days, if someone is parsing and analyzing text from Twitter, newsfeeds, the Enron email corpus, or somewhere else, it’s likely that they’ve used NLTK to do so. It makes use of other NLP tools such as WordNet and various methods of tokenization and stemming to offer the most comprehensive set of NLP capabilities found in one place.

As for core functionality, the item sales quantity example written in Python in a file called itemSalesScript.py might look like this:

import csv

def loadData():
  data = []
  with open('timeseries.tsv','rb') as file:
    for row in csv.reader(file,delimiter='	'):
      data.append(row)
  return data

dataList = loadData()
nrows = len(dataList)
totalQuantities = [0] * 1000
for i in range(nrows):
  totalQuantities[ int(dataList[i][0]) ] += int(dataList[i][1])

print totalQuantities[12]

You can run this script in Python from a Unix/Linux/Mac OS command line via the command

user$ python itemSalesScript.py

or copy and paste the contents of the file to a Python prompt and press Enter.

Note how indentation is used to denote the end of the function definition and for loop. Also notice that, like R, Python uses square brackets to select items from lists/vectors, but that Python uses a zero-based indexing system. To get the first item in a list called dataList in Python, you’d use dataList[0] instead of dataList[1] that you’d use in R or the dataList(1) that you’d use in MATLAB. That tripped me up a few times when I was learning Python, so watch out for it. Then again, most software developers are used to zero-based indexing in languages like Java and C, so they’d be more likely to be tripped up by R and MATLAB than by Python.

One final note about the code example: in two places I had to use the int function to coerce the given value from a string to an integer. This is because the csv package defaults to considering all values as strings unless told otherwise. There are surely better ways to handle this than the one I’ve used here, not the least of which is to convert the data to arrays using the numpy package, which is what I’d do anyway if I was working more intensely with the data, but I left it out for clarity of the example.

I would use Python in these situations:

  • If I’m creating an analytic software application, prototype, or maybe production software.
  • If I’m doing machine learning or NLP.
  • If I’m integrating with another software service or application.
  • If I’m doing a lot of non-statistical programming.

I wouldn’t use Python in these circumstances:

  • If I worked in a field where most people use another language and share their code.
  • If Python’s packages in my field were inferior to those of another language, like R.
  • If I wanted to generate graphs and plots quickly and easily. R’s plotting packages are significantly better.

I’ve mentioned that Python is now my language of choice, after switching from R a few years ago. I made the switch because I’ve been programming production proprietary software, and that involves a lot of non-statistical code, for which I find Python vastly better. Python’s licenses freely allow sale of your software without having to provide the source code. Overall, I recommend Python for people who want to do some data science as well as some other pure, non-statistical software development. It’s the only popular, robust language I know of that can do both well.

Java

Though not a scripting language and as such not well suited for exploratory data science, Java is one of the most prominent languages for software application development, and because of this it’s used often in analytic application development. Many of the same reasons that make Java bad for exploratory data science make it good for application development.

For one, Java has strong, static variable typing, which means that you have to declare what type of object a variable is when you create it, and it can never change. Java objects also have many different types of methods—public, private, static, final, and so on—and choosing the appropriate type can ensure that the method gets used correctly and only at suitable times. Variable scope and object inheritance rules are also quite strict, at least compared to Python and R. All these strict rules make writing code slower, but the resulting application is typically more robust and far less prone to error. I sometimes wish I could impose some of these restrictions on my Python code, because every so often a particularly nasty bug can be traced back to a dumb thing I did that could have been prevented by one of Java’s strict rules.

Java isn’t great for exploratory data science, but it can be great for large-scale or production code based on data science. Java has many statistical libraries for doing everything from optimization to machine learning. Many of these are provided and supported by the Apache Software Foundation.

I would use Java in the following situations:

  • If I’m creating an application that needs to be very robust and portable.
  • If, being already familiar with Java, I know it has the capabilities I need.
  • If I’m working on a team that uses mainly Java, and using another language would be a hardship for the overall development effort.

I wouldn’t use Java in the following circumstances:

  • If I’m doing a lot of exploratory data science.
  • If I didn’t know much about Java.
  • If I don’t need a truly robust, portable application.

Though I haven’t provided much detail about Java, I do want to convey the popularity of this language in data science–related applications and also say that, for most experienced developers I know, it would be their first choice for attempting to build a bulletproof piece of analytic software.

Table 8.1 summarizes when I’d use each programming language for projects in data science.

Table 8.1. A summary of when I would use each programming language for projects in data science

Language

When I would use it

When I would not use it

MATLAB/Octave If I’m working with large matrices or large numbers of matrices. If I know that a particular add-on package, particularly in MATLAB, will be greatly useful. If I have a MATLAB license and I like the matrix-friendly syntax. If I have data that isn’t well represented by tables or matrices. If I want my code to integrate with other software; it can be difficult and complicated, though there are various options. If I want to include my code in a software product to be sold. MATLAB’s license in particular can make this difficult legally.
R If I’m working in a field for which there are many R packages. If I’m working in academia, particularly bioinformatics or social sciences. If I’d like to load, parse, and manipulate varied data sets quickly. If I am creating production software. If I’m creating software to be sold. The GPL license has implications. If I’d like to integrate my code into software in other languages. If I’d like to use object-oriented architecture. It’s not great in R.
Python If I’m creating an analytic software application, prototype, or maybe production software. If I’m doing machine learning or NLP. If I’m integrating with another software service or application. If I’m doing a lot of non-statistical programming. If I worked in a field where most people use another language and share their code. If Python’s packages in my field are inferior to those of another language, like R. If I want to generate graphs and plots quickly and easily. R’s plotting packages are significantly better.
Java If I’m creating an application that needs to be very robust and portable. If, being already familiar with Java, I know it has the capabilities I need. If I am working on a team that uses mainly Java, and using another language would be a hardship for the overall development effort. If I’m doing a lot of exploratory data science. If I didn’t know much about Java. If I don’t need a truly robust, portable application.

8.3. Choosing statistical software tools

So far in this chapter I’ve talked about the basics of some statistical applications and programming and hopefully I’ve given you a good idea of the range of tools available for implementing the statistical methods that were discussed in the last chapter. If that chapter served its purpose, you’ve related your project and your data to some appropriate mathematical or statistical methods or models. If so, you can compare those methods or models with the software options available to implement them, arriving at a good option or two. In choosing software tools, there are various things to consider and some general rules to follow. I’ll outline those here.

8.3.1. Does the tool have an implementation of the methods?

Sure, you can always code the methods yourself, but if you’re using a fairly common method, then many tools probably already have an implementation, and it’s probably better to use one of those. Code that’s been used by many people already is usually relatively error free compared to some code that you wrote in a day and used only once or twice.

Depending on your ability to program and your familiarity with various statistical tools, you may have a readily available implementation in one of your favorite tools that you could put to use quickly. If Excel has it, then most likely every other tool does too. If Excel doesn’t, then maybe the mid-level tools do, and if they don’t, then you’re probably going to have to write a program. Otherwise, the only remaining option is to choose a different statistical method.

If you do decide to go with a programming language, remember that not all packages or libraries are created equal, so make sure the programming language and the package that you intend to use can do exactly what you want. It might be helpful to read the documentation or some examples that are relatively similar to the analysis you want to do.

8.3.2. Flexibility is good

In addition to being able to perform the main statistical analysis that you want, it’s often helpful if a statistical tool can perform some related methods. Often you’ll find that the method you chose doesn’t quite work as well as you had hoped, and what you’ve learned in the process leads you to believe that a different method might work better. If your software tool doesn’t have any alternatives, then you’re either stuck with the first choice or you’ll have to switch to another tool.

For example, if you have a statistical model and you want to find the optimal parameter values, you’ll be using a likelihood function and an optimization technique. In chapter 7, I outlined a few types of methods for finding optimal parameters from a likelihood function, including maximum likelihood (ML), maximum a posteriori (MAP), expectation-maximization (EM), and variational Bayes (VB). Although Excel has a few different specific optimization algorithms, they’re all ML methods, so if you think you can get away with ML but you’re not sure, you may want to level up to a more sophisticated statistical tool that has more options for optimization.

There are multiple types of regression, clustering, component analysis, and machine learning, among others, and some tools may offer one or more of those methods. I tend to favor those statistical tools that offer a few from each of these method categories in case I need to switch or to try another.

8.3.3. Informative is good

I’ve stressed that awareness in the face of uncertainty is a primary aspect of data science; this carries over into selection of statistical software tools. Some tools might give good results but don’t provide insight into how and why those results were reached. On one hand, it’s good to be able to deconstruct the methods and the model so that you understand the model and the system better. On the other hand, if your methods make a mistake in some way, and you find yourself looking at a weird, unexpected result, then more information about the method and its application to your data can help you diagnose the specific problem.

Some statistical tools, particularly higher-level ones like statistical programming languages, offer the capability to see inside nearly every statistical method and result, even black box methods like machine learning. These insides aren’t always user friendly, but at least they’re available. It’s my experience that spreadsheets like Excel don’t offer much insight into their methods, and so it’s difficult to deconstruct or diagnose problems for statistical models that are more complicated than, say, linear regression.

8.3.4. Common is good

With many things in life—music, television, film, news articles—popularity doesn’t always indicate quality, and in fact it often does the contrary. With software, more people using a tool means more people have tried it, gotten results, examined the results, and probably reported the problems they had, if any. In that way, software, notably open-source software, has a feedback loop that fixes mistakes and problems in a reasonably timely fashion. The more people participating in this feedback loop, the more likely it is that a piece of software is relatively bug free and otherwise robust.

This is not to say that the most popular thing right now is the best. Software goes through trends and fads like everything else. I tend to look at popularity over the past few years of use by people who are in a similar situation to me. In a general popularity contest of statistical tools, Excel would obviously win. But if you consider only data scientists, and maybe only data scientists in a particular field—excluding accountants, finance professionals, and other semi-statistical users—you’d probably see its popularity fade in favor of the more serious statistical tools.

A tool must meet these criteria if I’m going to use it:

  • The tool must be at least a few years old.
  • The tool must maintained by a reputable organization.
  • Forums, blogs, and literature must show that many people have been using the tool for quite some time and without many significant problems recently.

8.3.5. Well documented is good

In addition to being in common use, a statistical software tool should have comprehensive and helpful documentation. It’s very frustrating when I’m trying to use a piece of software and I have a question that I feel should have a straightforward answer, but I can’t find that answer anywhere.

It’s a bad sign if you can’t find answers to some big questions, such as how to configure inputs for doing linear regression or how to format the features for machine learning. If the answers to big questions aren’t in the documentation, then it’s going to be even harder to find answers to the more particular questions that you’ll inevitably run into later.

Documentation is usually a function of the age and popularity of the software. The official documentation for the tool should be on the maintaining organization’s web page, and it should contain informative instructions and specifications in plain language that you can understand. It’s funny to me how many software organizations don’t use plain language in their documentation or make their examples overly complicated. Perhaps it’s my aversion to unnecessary jargon, but I shy away from using software that has documentation I don’t readily understand.

Along with determining whether a tool is common enough, I also check forums and blog posts to determine whether there are sufficient examples and questions with answers that support the official documentation. No matter how good the documentation is, it almost certainly has gaps and ambiguities somewhere, so it’s helpful to have informal documentation as a backup.

8.3.6. Purpose-built is good

Some software tools or their packages were built for a specific purpose, and then other functionality was added on later. For example, the matrix algebra routines in MATLAB and R were of primary concern when the languages were built, so it’s safe to assume that they’re comprehensive and robust. In contrast, matrix algebra wasn’t of primary concern in the initial versions of Python and Java, and so these capabilities were added later in the form of packages and libraries. This isn’t necessarily bad; Python and Java happen to have robust matrix functionality now, but the same can’t be said for every language that claims to be able to handle matrices efficiently.

In cases where the statistical methods I want to use are a package, library, or add-on to the software tool that I want to use, I place the same scrutiny on that package that I would on the tool itself: is it flexible, informative, commonly used, well documented, and otherwise robust?

8.3.7. Interoperability is good

Interoperability is a sort of converse of being purpose-built, but they’re not mutually exclusive. Some software tools play well with others, and in these you can expect to be able to integrate functionalities, import data, and export results, all in generally accepted formats. This is helpful in projects where other software is being used for related tasks.

If you’re working with a database, it can be helpful to use a tool that can interact with the database directly. If you’re going to build a web application based on your results, you might want to choose a tool that supports web frameworks—or at least one that can export data in JSON or some other web-friendly format. Or if you’ll use your statistical tool on various types of computers, then you’ll want the software to be able to run on the various operating systems. It’s not uncommon to integrate a statistical software method into a completely different language or tool. If this is the case, then it’s good to check whether, for example, you can call Python functions from Java (you can, with some effort).

R was purpose-built for statistics, and interoperability was something of an afterthought, although there’s a vast ecosystem of packages supporting integration with other software. Python was built as a general programming language and statistics was an afterthought, but as I said, the statistical packages for Python are some of the best available. Choosing between them and others is a matter of vetting all languages, applications, and packages you intend to use.

8.3.8. Permissive licenses are good

Most software has a license, either explicit or implied, that states what restrictions or permissions exist on the use of the software. Proprietary software licenses are usually pretty obvious, but open-source licenses usually aren’t quite as clear.

If you’re using commercial software for commercial purposes, it can be legally risky to be doing so with an academic or student license. It can also be dangerous to sell commercial software, modified or not, to someone else without confirming that the license doesn’t prohibit this.

When I do data science using an open-source tool, the main question I have is can I create software using this tool and sell it to someone without divulging the source code? Some open-source licenses allow this, and some don’t. It’s my understanding (though I’m not a lawyer) that I can’t sell to someone an application that I’ve written in R without also providing the source code; in Python and Java, doing so is generally permitted, and this is one reason why production applications are not generally built in R and languages with similar licenses. There are usually legal paths around this, such as hosting the R code yourself and providing its functionality as a web service or something similar. In any case, it’s best to check the license and consult a legal expert if you suspect you might violate a software license.

8.3.9. Knowledge and familiarity are good

I put this general rule last, though I suspect that most people, me included, consider it first. I’ll admit: I tend to use what I know. There might be nothing wrong with using the tool you know best, as long as it works reasonably well with the previous rules. Python and R, for example, are pretty good at almost everything in data science, and if you know one better than the other, by all means use that one again.

On the other hand, many tools out there aren’t the right tool for the job. Trying to use Excel for machine learning, for example, isn’t usually the best idea, though I hear this is changing as Microsoft expands its offerings. In cases like this one, where you might be able to get by with a tool that you know, it’s definitely worth considering learning one that’s more appropriate for your project.

In the end, it’s a matter of balancing the time you’ll save by using a tool you know against the time and quality of results you’ll lose by using a tool that isn’t appropriate. The time constraints and requirements of your project are often the deciding factors here.

8.4. Translating statistics into software

Putting math into your code is no small task. Many people seem to think that doing math on data is as easy as importing a statistics library and clicking Go. Maybe if they’re lucky, it’ll work, but only until the uncertainty sneaks up on them and their lack of awareness of what’s going on in the statistical methods and code fails to prevent a problem of some kind. I know this is a contrived scenario, but I concoct such a heinous straw man to stress the importance of understanding the statistical methods that you’ve chosen and how they relate to the software that you’re using or creating.

8.4.1. Using built-in methods

Any of the mid-level statistical tools should have proper instructions for how to apply their various statistical methods to your data. Although I don’t have much recent experience, I expect that either the in-application guidance or documentation available online should suffice for anyone to figure out how to apply a standard statistical method.

Programming languages are usually a bit more complicated, and I’ve found that it’s often quite hard to find bare-basics instructions and examples on how to implement and perform even the simplest statistical analyses. It’s my impression that most documentation assumes a fair amount of knowledge of the language, which can make it confusing for beginners. Therefore, I’ll present two examples here that illustrate how linear regression can be applied, one in R, and one in Python.

Linear regression in R

R typically uses a functional style, as in this example:

data = data.frame(X1 = c( 1.01, 1.99, 2.99, 4.01 ),
                  X2 = c( 0.0, -2.0, 2.0, -1.0 ),
                  y  = c( 3.0, 5.0, 7.0, 9.0 ))

linearModel <- lm(y ~ X1 + X2, data)
summary(linearModel)

predict(linearModel,data)

First, this script creates a data frame object containing three variables, X1, X2, and y. A data frame is an object, but here it’s constructed by a function called data.frame, which returns a data frame object that’s stored in the variable data.

The assumed task here is that you want to perform linear regression such that X1 and X2 are inputs, and y is the output. You want to be able to use X1 and X2 to predict y, and you want to find a good linear model that can do this. The second command in the script specifies that you want to create a linear model via the function lm whose parameters are, first, a formula, y ~ X1 + X2, and second, a data frame, data, containing the data. Formulas are curious but useful constructs in R that I’ve hardly seen anywhere else. They’re intended to represent a sort of mathematical relation between the variables. As you can probably guess, this formula tells the lm function that you want to predict y using X1 and X2. The intercept (y-value when all input variables are zero) is also automatically added, unless you explicitly remove it. You can add more input variables to the formula or add combinations of variables, such as the product of two variables, the square of a variable, and so on. You have numerous possibilities for constructing a formula in R, and it can get quite complicated, so I suggest consulting the documentation before you write your own.

The variables named in the formula must match some variables included in the data frame passed to lm. The lm function itself performs the regression and returns a fitted linear model, which is stored in the variable linearModel. It’s important to note that I created the data in this example to give expected regression results. Each of the four data points has a value for X1, X2, and y. The data frame data looks like this (the > is the R prompt):

> data
  X1  X2  y
1 1.01  0  3
2 1.99  -2  5
3 2.99  2  7
4 4.01  -1  9

If you study the numbers closely, you may notice that in each row the y-value is pretty close to the corresponding result of 2*X1+1, and the X2 values don’t contribute much information about what the value of y will be. You say that X1 is predictive of y, but X2 is not, so you expect your regression results to be very nearly that. You’d expect the results of linear regression to indicate this.

The command summary(linearModel) prints output to the screen, giving information about the linear model that was fit to the data, which can be seen here:

> summary(linearModel)

Call:
lm(formula = y ~ X1 + X2, data = data)

Residuals:
       1        2        3        4
-0.02115  0.02384  0.01500 -0.01769

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.001542   0.048723  20.556  0.03095 *
X1          1.999614   0.017675 113.134  0.00563 **
X2          0.002307   0.013361   0.173  0.89114
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03942 on 1 degrees of freedom
Multiple R-squared:  0.9999,       Adjusted R-squared:  0.9998

You can see in this output that the coefficient estimate for X1 is very nearly 2, that for X2 is almost zero, and that of the intercept is above 1, so the results meet your expectations.

The rest of the output—standard errors, p-values, R-squared, and so on—give the goodness of fit, significance of coefficients, and other statistics about the model that indicate whether it’s a good model or not.

The final line of the script predicts y-values for the data points provided in the input data frame, which in this case is the same as the data with which you trained the model. The predict function takes the variables X1 and X2 in data and outputs y-values according to the model in linearModel. Here’s the printed output:

> predict(linearModel,data)
    1    2    3    4
3.021152 4.976159 6.985002 9.017687

Each of these four values is the prediction of the model for each of the data points (rows) in data. As you can see, the values are pretty close to the y-values on which the model was trained.

Linear regression in Python

Python has multiple packages that offer linear regression methods, and it provides a good example of having to balance the general rules for choosing software I outlined previously. From what I’ve gathered, the LinearRegression object in the sklearn package might be the most popular, but the summary of the fit model isn’t nearly as informative as the output from the function used in R. But the linear_model object in the package statsmodels does easily provide informative output, so I use that here.

The following code uses an object-oriented style, and the two main objects created are the linear model, which I called linearModel in the script, and a results object, which I called results. Methods of those objects are called to create the model, fit the model, summarize results, and make the same predictions you made in the R script. If you’re not familiar with object-oriented coding, it can seem a bit weird at first:

import statsmodels.regression.linear_model as lm

X = [ [ 1.01,  0.0, 1 ],
      [ 1.99, -2.0, 1 ],
      [ 2.99,  2.0, 1 ],
      [ 4.01, -1.0, 1 ] ]
y = [ 3.0, 5.0, 7.0, 9.0 ]

linearModel = lm.OLS(y,X)
results = linearModel.fit()
results.summary()

results.predict(X)

Notice how I created the variables X and y. They’re lists of values that, when fitting the model, are coerced into the appropriate array/matrix form. Data frames from the package pandas can also be used, but I elected not to use them here. I also added a column of 1s to the right of the X data because the model doesn’t automatically add an intercept value. There are other, more elegant ways to add an intercept, but I didn’t use them for the sake of clarity.

After creating the data objects, the script performs the same steps that the R script does: create the model, using the OLS class in the statsmodels package; fit the model, using the object method call linearModel.fit(); print a summary of results; and predict y values from the original X values.

The printed output from the method call results.summary() is shown here:

OLS Regression Results
===========================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                     6436.
Date:                Sun, 03 Jan 2016   Prob (F-statistic):            0.00881
Time:                        13:45:54   Log-Likelihood:                 10.031
No. Observations:                   4   AIC:                            -14.06
Df Residuals:                       1   BIC:                            -15.90
Df Model:                           2
Covariance Type:            nonrobust
===========================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------

It shows many of the same statistics as the summary in R plus some others. More importantly, it gives identical results.

8.4.2. Writing your own methods

As a researcher in academia, I mostly developed algorithms for analyzing various systems and data in the field of bioinformatics. Because they were new, I didn’t have the luxury of applying a method in a software package, though in some cases I used some available methods to aid in things like optimization and model fitting.

Creating new statistical methods can be time consuming, and I don’t recommend doing it unless you know what you’re doing or it’s your job. But if you must, it helps to know where to start. Generally speaking, I begin with the mathematical specification of the statistical model, which might look something like the model I described in the last chapter, which contained several specifications of the probability distributions of model parameters, such as this:

xn,g ~ N( μg , 1/λ )

If the model and all of its parameters and variables have been specified like that, then you need to convert the specification into a likelihood function that you can use to find optimal parameter values or distributions.

To get a likelihood function, which is a function of the parameter values based on the data values, you use the mathematical specification of the probability distribution functions for each individual data point and multiply them together. Or because probability densities are usually pretty small, and multiplying them together only makes for a very small number, it’s usually better to take the logarithm of the likelihood functions for each data point and then add them together. Because the logarithm of a function has its maximum at the same point as the function itself, maximizing the sum of the log-likelihoods is equivalent to maximizing the product of likelihoods. In fact, most software and algorithms designed for working with probabilities and likelihoods use the advantages of taking the logarithm.

Once you have the mathematical specification of the joint likelihood and log-likelihood, you can convert this directly into software by using whatever mathematics libraries the language offers. It should be straightforward to write a software function that takes as input the parameter values and gives as output a log-likelihood based on the data. This function should do in software exactly what the log-likelihood function would do in mathematics.

Now that you have a software version of the joint likelihood function, you need to find optimal parameter values or distributions using one of the algorithms discussed in chapter 7: maximum likelihood estimation (MLE), maximum a posteriori (MAP), expectation-maximization (EM), variational Bayes (VB), or Markov chain Monte Carlo (MCMC). Most statistics software has methods for MLE, which would entail maximizing your joint likelihood function using an optimization routine. This can be easy for simple models but tricky for complex ones.

Using MAP methods would mean you’d have to go back to your model equations and calculate mathematically a specification of a posterior distribution of the parameter likelihood. Then you could maximize this posterior distribution in much the same way as you would for MLE. Formulating a posterior distribution isn’t trivial, so you may want to consult a reference on Bayesian models before trying it.

EM, VB, and MCMC also typically depend on the same posterior distribution that MAP does. Many software tools have an implementation of MCMC, so you might be able to apply those directly to get an estimate of the parameter posterior distribution, but with EM and VB, you usually have to code the model-fitting algorithm yourself, though there have been efforts to create software that simplifies the process. The difficulty of developing algorithms like EM and VB is probably one of the main reasons why MCMC is so popular. It can be tricky to get MCMC to work, but once it does, it lets raw computational power take the place of the human programming required for the other two algorithms.

Exercises

Continuing with the Filthy Money Forecasting personal finance app scenario first described in chapter 2, and relating to previous chapters’ exercises, try the following:

1.

What are your two top choices of software for performing the calculations necessary for forecasting in this project and why? What’s a disadvantage for each of these?

2.

Do your two choices in question 1 have built-in functions for linear regression or other methods for time-series forecasting? What are they?

Summary

  • Statistical software is an implementation of theoretical statistical models; understanding the relationship between the two is important for awareness of your project.
  • A wide range of software is available for doing data science, from spreadsheets to mid-level statistical tools to statistical programming languages and libraries.
  • Sometimes even spreadsheets can be useful to data scientists, for simpler tasks.
  • Several good mid-level statistical tools are on the market, each with strengths and limitations.
  • Programming isn’t that hard, but it does take some time to learn, and it offers maximum flexibility in doing statistics on your data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.35