Chapter 9. Supplementary software: bigger, faster, more efficient

This chapter covers

  • Non-statistical software that can help you do statistics more efficiently
  • Some popular and ubiquitous software concepts related to analytic software
  • Basic guidelines for using supplementary software

Figure 9.1 shows where we are in the data science process: optimizing a product with supplementary software. The software tools covered in chapter 8 can be very versatile, but there I focused mainly on the statistical nature of each. Software can do much more than statistics. In particular, many tools are available that are designed to store, manage, and move data efficiently. Some can make almost every aspect of calculation and analysis faster and easier to manage. In this chapter I’ll introduce some of the most popular and most beneficial software for making your life and work as a data scientist easier.

Figure 9.1. An important aspect of the build phase of the data science process: using supplementary software to optimize the product

9.1. Databases

I discussed the concept of a database in chapter 3 as one form of data source. Databases are common, and your chances of running across one during a project are fairly high, particularly if you’re going to be using data that’s used by others quite often. But instead of merely running into one as a matter of course, it might be worthwhile to set up a database yourself to aid you in your project.

9.1.1. Types of databases

Many types of databases exist, each designed to store data and provide access to it in its own way. But all databases are designed to be more efficient than standard file-based storage, at least for some applications.

More types (and subtypes) exist, but the two most common categories of databases today are relational and document-oriented. Though I’m certainly not an expert in database models and theory, I will attempt to describe the two types as I see them from the perspective of how I typically think about them conceptually and interact with them.

Relational

Relational databases are all about tables. A table in a relational database can usually be visualized as a two-dimensional sheet such as those found in spreadsheets: the sheet contains rows and columns, with data elements in the cells.

The powerful thing about relational databases is that they can hold many tables and can, behind the scenes, relate the tables to each other in clever ways. This way, even the most complicated queries from multiple tables and data types can be executed in an optimal way, often saving enormous amounts of time when compared with a primitive scan-the-tables approach to find data that matches a query.

Among relational databases, which have been popular for decades, one dominant language has emerged for formulating queries: structured query language (SQL, usually pronounced “sequel”). SQL is practically ubiquitous, though other query languages for relational databases do exist. In short, you can use SQL to query many different kinds/brands/subtypes of relational databases; so if you’re familiar with SQL, you can begin to work with an unfamiliar database without learning a new query syntax. On the other hand, not all SQL-based databases use exactly the same syntax, so some relatively minor adaptations to a specific query may be necessary for it to work on a new database.

Document-oriented

In some sense, document-oriented databases are the antithesis to relational databases. Whereas relational databases have tables, document-oriented databases have, well, documents. No surprise there.

A document in this case can be a set of so-called unstructured data, like the text of an email, along with a set of structured identifying information, such as an email’s sender and time sent. It’s closely related to the key-value concept of data storage, wherein data is stored and catalogued for easy retrieval according to a set of a few keys. Keys are usually chosen to be the fields of a data point by which you would find it when querying—for example, ID, name, address, date, and so on. The values of a data point can be thought of as a sort of payload that’s stored alongside the keys but that generally isn’t used to find the data point in a query. It can be a messy pile of data if you want because you’re not usually querying the data using the value (for a notable exception, keep reading).

Things like raw text, lists of unknown length, JSON objects, or other data that doesn’t seem well suited for fitting into a table can usually fit easily into a document-oriented database. For efficient querying, each piece of unstructured data (a possible value) would ideally be matched with a few bits of structured identifying information (potential keys), because almost without exception databases handle structured data much more efficiently than unstructured.

Because document-oriented databases are a sort of antithesis of relational databases and tables, the term NoSQL is often applied to them. You’ll find other types of NoSQL databases, but document-oriented is the largest subclass.

Besides being generally more flexible and probably less efficient than relational databases, document-oriented databases can have their own strengths. An example of such a strength can be seen in the popular Elasticsearch data store. Elasticsearch is an open-source document-oriented database built on top of the (also open-source) Apache Lucene text search engine. Lucene and Elasticsearch are good at parsing text, finding certain words and word combinations, and generating statistics about the occurrences of those words. Therefore, if you’re working with a large number of text documents, and you’ll be studying the occurrence of words and phrases, few if any databases (relational or not) will be as efficient as Elasticsearch.

Querying an Elasticsearch (or similar) database by raw text is a notable exception to the general rule that you should query by key and not by value. Because Lucene does such a good job of indexing text, querying by terms in the text behaves more like searching by key than in most other databases.

Other database types

If you’re working with a specific type of data that can’t be represented easily as a set of tables or documents—and so isn’t ideal for either a relational or document-oriented database—it might be worth searching for a database that suits that data type. For instance, graph data such as I’ve used in social-network analysis projects can often benefit from the efficiencies of a graph database. Neo4j is a popular graph database that represents connections between things (such as people in a social network) in a way that makes storage, querying, and analysis of graph data easier. There are many other examples of databases that cater to very specific data types, but I won’t attempt to survey them here. A quick online search should lead you in the right direction.

9.1.2. Benefits of databases

Databases and other related types of data stores can have a number of advantages over storing your data on a computer’s file system. Mostly, databases can provide arbitrary access to your data—via queries—more quickly than the file system can, and they can also scale to large sizes, with redundancy, in convenient ways that can be superior to file system scaling. Here I give brief descriptions of some of the major advantages that databases can offer.

Indexing

A database index is a set of software tricks that generate a sort of map of all the data so that anything can be found quickly and easily. Indexing is the process of building such a map. Often indexing makes efficient use of hardware—disk and memory—to improve overall efficiency.

The price of having an index (versus no index) is some disk and memory space, because the index itself takes up room. Usually you have a choice of creating a very efficient index that takes up more space or a less efficient index that takes up less space. The optimal choice depends on what you’re trying to accomplish.

Caching

Caching, in the general sense, is the holding aside of certain data that’s accessed very often with the goal of gaining efficiency overall because that often-used data is readily available in a special location. When certain bits of data are accessed often, you can decrease the overall average access time by holding the often-used data close at hand in some sense (various aspects of system architecture make this possible). If the often-used data has very short access times, it doesn’t matter if the occasional rarely accessed data takes a bit longer to find. Databases often try to recognize the data that’s used most often and hold it close instead of putting it back among the rest of the data. Like indexing, caching takes up space, but you usually have a choice of how much space you’d like to dedicate to the cache, which in turn determines how effective it is.

Scaling

Many types of databases in existence today can be distributed over many machines. Obviously, this isn’t a direct advantage over storing your data in files on a disk, because if you have access to many machines, you have access to many disks. The advantage, then, of a distributed database over a distributed file system is the coordination.

If you have data on many disks on many machines, you have to keep track of what you’re keeping where. Distributed databases are designed to do this automatically. Distributed databases typically consist of shards, or chunks of data that each exist in a single location. A central server (or multiple servers) manages access and transfer between shards. Additional shards can be used to increase the potential size of the database or to replicate data that exists elsewhere, according to the chosen database configuration.

Concurrency

If two different computer processes try to change the same data point at the same time, the changes are said to be concurrent, and the issue of finding the proper final state is generally referred to as concurrency. Databases generally handle this better than the file system. Specifically, if two different processes are trying to create or edit the same file at the same time, any number of errors may occur, or none at all, which is sometimes a bigger problem. Generally speaking, you want to avoid concurrency at all costs on a file system, but certain types of databases provide convenient solutions for resolving any conflicts.

Aggregations

A database’s index can be applied in tasks other than finding data matching a query. Often databases provide functionality for performing aggregations of the data matching a query or all data. A database might be able to add up, multiply, or summarize data much faster than your code would, and so it could be helpful to push this summarization to the database and increase overall efficiency.

For example, Elasticsearch makes it easy to calculate the frequency of certain search terms within a database. If Elasticsearch didn’t provide this functionality, you’d have to query for all occurrences of the term, count the number of occurrences, and divide by the total number of documents. That may not seem like a problem, but if you’re doing this thousands or millions of times, allowing the database to calculate the frequencies in an optimized, efficient way can save a considerable amount of time.

Abstracted query language

Querying a database for certain data involves formulating the query in a query language, such as SQL, that the database understands. Although it can be annoying to have to learn a new query language for a new database, these languages offer abstraction from the search algorithm that underlies the query. If your data was stored in files on the file system and you weren’t using a database, every time you wanted to search for data points meeting certain criteria, you’d have to write an algorithm that goes through all your files—all the data points—and checks to see if they meet your criteria. With a database, you don’t have to worry about the specific search algorithm because the database handles it. The query language provides a concise, often readable description of what you’re looking for, and the database finds it for you.

9.1.3. How to use databases

Most software tools, Excel included, can interface with databases, but some are better at it than others. The most popular programming languages all have libraries or packages for accessing all the most popular databases. Learning how it’s done is a matter of checking the documentation. Generally speaking, you’ll have to know how to do the following:

  • Create the database.
  • Load your data into the database.
  • Configure and index the database.
  • Query the data from your statistical software tool.

Each database is a bit different, but once you get used to a couple of them, you’ll see similarities and learn more of them quickly. It seems that today there’s a book in publication for every type of database out there, so it’s a matter of finding it and putting it to use. For NoSQL databases, the offerings can be particularly broad, diverse, and overwhelming, so a book like Making Sense of NoSQL (McCreary and Kelly, Manning, 2013) can help you sort through all the capabilities and options.

9.1.4. When to use databases

If accessing your data from the file system is slow and awkward, it’s probably time to try a database. It can also depend on how you’re accessing your data.

If your code is often searching for specific data—thousands or millions of times—a database can greatly speed up access times and the overall execution time of your code. Sometimes code can become orders of magnitude faster upon switching from file system storage to a database. One of my projects once sped up by 1000 times when I first made the switch.

If you have data on the file system and you mostly proceed through it top to bottom, or if you don’t search often, then a database might not help you much. Databases are best for finding on-demand data that matches specific criteria, and so if you don’t need to query, search, or jump around in the data, the file system might be the best choice.

One reason why I sometimes resist using a database is that it adds some complexity to the software and the project. Having a live, running database adds at least one more moving part to all the things you need to keep your eye on. If you need to transport your data to multiple machines or locations, or if you worry that you don’t have the time to configure, manage, and debug yet another piece of software, then maybe creating a database isn’t the best idea. It certainly requires at least a little maintenance work.

9.2. High-performance computing

High-performance computing (HPC) is the general term applied to cases where there’s a lot of computing to do and you want to do it as fast as possible. In some cases you need a fast computer, and in others you can split up the work and use many computers to tackle the many individual tasks. There is also some middle ground between these two.

9.2.1. Types of HPC

Beyond the question of having one computer or many, you may also consider using computers that are good at certain tasks or compute clusters that are configured and organized in particularly useful ways. I describe a few of the options here.

Supercomputers

A supercomputer is an extremely fast computer. There’s something of a worldwide competition for the fastest supercomputer, a title that carries more prestige than anything else. But the technological challenges for taking the title are not small, and neither are the results.

A new supercomputer is millions of times faster than a standard personal computer — it could probably compute your results millions of times faster than your PC. If you have access to one—and not many people have such access—it might be worth considering.

Most universities and large, data-oriented organizations with an IT department may not have a supercomputer, but they have a powerful computer somewhere. Computing your results 100x or 1000x faster might be possible, if only you ask the right people for access.

Computer clusters

A computer cluster is a bunch of computers that are connected with each other, usually over a local network, and configured to work well with each other in performing computing tasks. More so than with a supercomputer, computing tasks may need to be explicitly parallelized or otherwise split into separate tasks so that each computer in the cluster can perform some part of the work.

Depending on the cluster, the various computers and the tasks they’re executing may be able to communicate with each other efficiently, or they may not. Some types of commodity computer clusters (HTCondor is a popular software framework for unifying them) focus less on optimizing the individual machines and more on maximizing the total amount of work the cluster can do. Other cluster types are highly optimized for performance that, in aggregate, resembles a supercomputer.

One shortcoming of a cluster when compared to a supercomputer is usually the available memory. In a supercomputer, there’s usually one giant pool of available memory, and so extremely large and complex structures can be held in memory — which is much, much faster than trying to store that structure on a disk or in a database. In a cluster, each computer has only its own available memory, so it might be able to load only a small piece of a complex structure at one time. Writing and reading to disk can cost time and overall performance, but it depends highly on the specific calculations that are being done. Highly parallel calculations are more suitable for clusters.

GPUs

Graphics processing units (GPUs) are circuits that are designed to process and manipulate video images on a computer screen. The video card on every computing device with a screen has a GPU.

The nature of video manipulation has resulted in GPU designs that are very good at performing highly parallelizable calculations. In fact, some GPUs are so good at certain types of calculations that they’re preferred over standard CPUs. For a while, several years ago, researchers were buying and building clusters out of video game systems such as the Sony PlayStation because the computing power available from the systems’ GPUs was greater than that of other computers of similar price.

9.2.2. Benefits of HPC

The one and only benefit of HPC is quite simple to state: speed. HPC can do your computing faster than standard computing, also known as low-performance computing. If you have access—and this is a big if—then HPC is a good alternative to waiting for your PC to calculate all the things that need to be calculated. Cloud computing, which I discuss later in this chapter, makes HPC available to everyone—for a price. The benefit of using a cloud HPC offering—and some pretty powerful machines are available—must be weighed against the monetary cost before you opt in.

9.2.3. How to use HPC

Using a supercomputer, computer cluster, or GPU can be quite similar to using your own personal computer, assuming you know how to make use of multiple cores of your machine. The statistical software tools and languages that you’re using typically have a method to use multiple cores of a personal computer, and these methods usually transfer nicely to HPC.

In the R language, I used to use the multicore package for parallelizing my code and using multiple cores. In Python, I use the multiprocessing package for the same purpose. With each of these, I can specify the number of cores I’d like to use, and each has some notion of sharing objects and information between the processes running on the various cores. Sharing objects and information between processes can be tricky, so you, particularly as a beginner, should shy away from doing it. Purely parallel is much easier on the code and on the brain, if you’re able to implement your algorithm that way.

In my experience, submitting my code to a computer cluster was similar to running it on my own machine. I asked my colleagues at the university where I was working what the basic command was for submitting a job to the cluster queue, and then I adapted my code to conform. I could specify the number of computer cores I would like as well as the amount of memory, both of which affected my status in the queue. The cluster at this particular university, as at most, was in high demand, and queuing was both a necessity and a bit of a game.

Sometimes, particularly with GPUs, it’s necessary to modify your code to make explicit use of special hardware capabilities. It’s usually best to consult an expert or wade through the documentation.

9.2.4. When to use HPC

Because HPC is faster than the alternative, the rule is: if you have access, use it. If there’s no cost to you, and you don’t have to change your code much to take advantage of it, the question is a no-brainer. But it isn’t always that simple. If I have the option of using some HPC solution, I think first about the code changes and other legwork I’ll have to do in order to use HPC and then I compare that to the computing time I’ll save. Sometimes, if you’re not in a hurry, HPC isn’t worth it. Other times, it can give you results in an hour that would have otherwise taken a week or longer.

9.3. Cloud services

Cloud services were all the rage a few years ago. They’re still very popular, but they’re growing more mature and becoming less of a novel technology. It’s safe to say, however, that they’re here to stay. In short, cloud services provide, rentable by the hour, the capabilities you could otherwise get only by buying and managing a rack of servers yourself.

The largest providers of cloud services are mostly large technology companies whose core business is something else. Companies like Amazon, Google, and Microsoft already had vast amounts of computing and storage resources before they opened them up to the public. But they weren’t always using the resources to their maximum capacity, and so they decided both to rent out excess capacity and to expand their total capacity, in what has turned out to be a series of lucrative business decisions.

9.3.1. Types of cloud services

Services offered are usually roughly equivalent to the functionality of a personal computer, computer cluster, or local network. All are available in geographic regions around the world, accessible via an online connection and standard connection protocols, as well as, usually, a web browser interface.

Storage

All the major cloud providers offer file-storage services, usually paid per gigabyte per month. There are often also various tiers for storage, and you may pay more if you want faster reading or writing of your files.

Computers

This is probably the most straightforward of cloud offerings: you can pay by the hour for access to a computer with given specifications. You can choose the number of cores, the amount of machine memory, and the size of the hard disk. You can rent a big one, fire it up, and treat it like your supercomputer for a day or a week. Better computers cost more, naturally, but the prices are falling every year.

Databases

As an extension to the storage offered by cloud providers, there are also cloud-native database offerings. This means you can create and configure databases without ever having a sense of which computers or disks the database is running on.

This machine agnosticism can save some headaches when maintaining your databases, because you don’t have to worry about configuring and maintaining the hardware as well. In addition, the databases can scale almost infinitely; the cloud provider is the one that has to worry about how many machines and how many shards are involved. The price, a literal one, is that you will often be charged for each access to the database—reads and writes—as well as for the volume of data stored.

Web hosting

Web hosting is like renting a computer and then deploying a web server to it, but it comes with a few more bells and whistles. If you want to deploy a website or other web server, cloud services can help you do so without worrying much about the individual computers and machine configurations. They typically offer platforms under which, if you conform to their requirements and standards, your web server will run and scale with usage without much hassle. For example, Amazon Web Services has platforms for deploying web servers using Python’s Django framework as well as the Node.js framework.

9.3.2. Benefits of cloud services

There are two major benefits of using cloud services, as compared to using your own resources, particularly if you’d have to purchase the local resources. First, cloud resources require zero commitment. You can pay only for the amount that you use them, which can save tons of money if you’re not sure yet how much capacity you’ll need. Second, cloud services have a far greater capacity than anything you might buy yourself, unless you’re a Fortune 500 corporation. If you’re not yet sure about the size of your project, cloud services can give you extreme flexibility in the amount of storage and computer power, among other things, that you can access at a moment’s notice.

9.3.3. How to use cloud services

With an incredible variety of cloud services, you have almost unlimited combinations of ways you might use them together. The first step is always to create an account with the provider and then to try out the basic level of the service, which is usually offered for free. If you find it useful, then scaling up is a matter of using it more and paying the bill. Note that it’s often worth comparing similar services before diving in.

9.3.4. When to use cloud services

If you don’t own enough resources to adequately address your data science needs, it’s worth considering a cloud service. If you’re working with an organization that has its own resources, it may be cheaper to exhaust the local options before paying for the cloud. On the other hand, even if you have considerable resources locally, the cloud certainly has more; if you continually run into local resource limits, remember that the cloud provides virtually limitless capacity.

9.4. Big data technologies

If, in the analytic software industry, there was a phrase more often spoken than cloud computing in the last 10 years, it was big data. It’s a shame that the phrase and the technologies it describes were understood far less often than the phrase was spoken.

I’m going to take some liberties in talking about big data because I don’t feel that it ever possessed anything resembling a concrete definition. Everyone in the software industry from developers to salespeople used the phrase to pump up the impression of the software they were building or peddling, and not all the usages agreed with each other. I’m going to describe here not what I think every single person means when they use the phrase big data but what I mean when I say it. I think my own meaning is important because I tried to distill the concept down to the core ideas and technologies that were somewhat revolutionary when they came to market in the mid-2000s.

I don’t use big data to mean “lots of data.” Such a usage is doomed to become obsolete, and quickly, as we argue about what lots means. In my personal experience, 10 years ago 100 gigabytes was a lot of data; now 100 terabytes is routine. The point is that the word big will always be relative, and so any definition I concoct for big data must likewise be relative.

Therefore, my own personal definition of big data is based on technologies, not necessarily the size of data sets: big data is a set of software tools and techniques that was designed to address cases in which data transfer was the limiting factor in computational tasks. Whenever the data set is too big to move, in some sense, and special software is used to avoid the necessity of such data movement, the phrase big data is applicable.

Perhaps an example can illustrate the concept best. Google, arguably one of the first forces behind big data technologies, processes a tremendous amount of data on a regular basis in order to support its main business: a search engine that’s supposed to find anything on the internet, which is obviously a vast place, and systems that place advertisements intelligently onto web pages. Maintaining the best search results involves analyzing the number and strength of links from all pages on the internet to all other pages. I don’t know how big this data is right now, but I know it’s not small. Certainly, the data is spread across many servers, probably in many different geographic locations. Analyzing all the data in order to generate a basis for internet-wide search results is a task of complex coordination among all the data servers and data centers, involving a huge amount of data transfer.

Google, being smart, realized that data transfer was a major issue that was slowing down its calculations considerably. It figured minimizing such transfer was probably a good idea. How to minimize it, however, was a different question. The following explanation of what Google did, and most of the preceding description as well, is what I’ve inferred from Google’s release, years ago, of information regarding its MapReduce technology and other technologies it inspired, such as Hadoop. I don’t know what happened at Google, and I can’t claim to have read all papers and articles that have been published on the topic, but I do think the following hypothetical explanation is enlightening for anyone wondering how big data technologies work. It definitely would have been enlightening to me a few years ago.

In retrospect, what I would have done, had I worked at Google when it realized data transfer was killing analytic efficiency, was design a three-stage algorithm with the goal of minimizing data transfer while still performing all the calculations I wanted to perform.

The first step in the algorithm was to perform an initial calculation on each of the data points on the servers local to each of the databases. This local calculation resulted in, among other things, an attribute that indicated to which group of data points this particular data point belonged. In online search terms, this attribute corresponded to the corner of the internet in which this data point, probably a web page, would be found. Web pages tend to link to other pages within the same corner and not as much to pages in other corners of the internet. For each data point, once attributes specifying the corner(s) of the internet were determined, Google’s algorithm proceeded to the second step. Within the MapReduce framework, this is the map step.

Step two surveyed the new attributes for the data points and minimized the transfer of data from one geographical place to another. If most of Corner X’s data was on Server Y, step two would send all Corner X data to Server Y, so only a fraction of Corner X data would need to be transferred at all; most of the data was already there. This step is colloquially referred to as the shuffle step and, if done cleverly, provides one of the major advantages of using the most popular big data technologies.

Step three, then, is to take all the data points with a common attribute and analyze them all at once, generating some common results and/or some individual results that take into account the other data with the same attribute. This step analyzes all the web pages in Corner X and gives results not only about Corner X but also about all the pages in Corner X and how they relate to each other. This is called the reduce step.

The general summary of the three steps is this: some calculations are done locally on each data point, and data is mapped to an attribute; for each attribute, all data points are collected, while data transfer/shuffling is minimized; finally, all data points for each attribute are reduced to a set of useful results. Conceptually, the MapReduce paradigm, which is the basis for many other big data technologies, but certainly doesn’t include all of them.

9.4.1. Types of big data technologies

Hadoop is an open-source implementation of the MapReduce paradigm. It has been very popular but seems to have lost steam in the last couple of years. Hadoop was originally a tool for batch processing, and since its maturity other big data software tools that claim to be real time have begun to supplant it. They all have in common the notion that too much data transfer is detrimental to the process, so data local computation should be favored whenever possible.

Some big data concepts have led to the development of databases that make explicit use of the MapReduce paradigm and its implementations like Hadoop. The Apache Software Foundation’s open-source projects HBase and Hive, among others, rely explicitly on Hadoop to power databases that are designed to function well at extremely large scales, whatever that means to you in whatever year you’re reading this.

9.4.2. Benefits of big data technologies

Big data technologies are designed not to move data around much. This saves time and money when the data sets are on the very large scales for which the technologies were designed.

9.4.3. How to use big data technologies

This varies greatly depending on the technology. But they generally mimic the non–big data versions, at least at small scales. You can get started with a big data database much as you would a standard database, but perhaps with a bit more configuration.

Other technologies, including Hadoop in particular, require a little more effort. Hadoop and other implementations of MapReduce require specifications for mappers in step one and reducers in step two. Experienced software developers won’t have a problem coding basic versions of these, but some tricky peculiarities in implementation and configuration might cause problems, so take some care.

9.4.4. When to use big data technologies

Whenever computational tasks are data-transfer bound, big data can give you a boost in efficiency. But more so than the other technologies described in this chapter, big data software takes some effort to get running with your software. You should make the leap only if you have the time and resources to fiddle with the software and its configurations and if you’re nearly certain that you’ll reap considerable benefits from it.

9.5. Anything as a service

This obviously isn’t a real thing, but I often feel that it is. It’s sometimes hard to read software descriptions without coming across the phrase software as a service (SaaS), platform as a service (PaaS), infrastructure as a service (IaaS), or any other something as a service. Though I make fun of it, it’s very much a boon to the software industry that so many things are offered as a service. The purpose of services for hire is to replace the things that we would do ourselves, and it’s our hope that the service provided is better or more efficient than what we would have done ourselves.

I’m a big fan of letting other people do any standard task that I don’t want to do myself, in real life as well as in software. An increasing number of such tasks are available as a service in today’s internet-connected economy, and I see no reason to expect that trend to slow down any time soon. Though I won’t discuss any specific technologies in this section, I emphasize that you may be able to simplify greatly your software development and maintenance tasks by hiring out some of its more common aspects. From hardware maintenance to data management, application deployment, software interoperability, and even machine learning, it’s possible to let someone else handle some of the less-concerning aspects of whatever you’re building. The caveat is that you should trust those you hire to do a good job, and that trust may take some effort to build. A simple online search can provide some worthwhile candidates for offloading some of your work.

Exercises

Continuing with the Filthy Money Forecasting personal finance app scenario first described in chapter 2, and relating to previous chapters’ exercises, try these:

1.

What are three supplementary (not strictly statistical) software products that might be used during this project, and why?

2.

Suppose that FMI’s internal relational database is hosted on a single server, which is backed up every night to a server at an offsite location. Give a reason why this could be a good architecture and one reason why it might be bad.

Summary

  • Some technologies don’t fall under the category of statistical software, but they’re useful in making statistical software faster, scalable, and more efficient.
  • Well-configured databases, high-performance computing, cloud services, and big data technologies, all have their place in the industry of analytical software, and each has its own advantages and disadvantages.
  • When deciding whether to begin using any of these auxiliary technologies, it’s usually best to ask the question: are there any gross inefficiencies or limitations in my current software technologies?
  • It takes time and effort to migrate to a new technology, but it can be worth it if you have a compelling reason.
  • There’s been a lot of hype surrounding cloud services and big data technologies; they can be extremely useful but not in every project.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.220.219