Building a machine learning application

Machine learning applications, especially those focused on classification, usually follow the same high-level workflow as shown in the following diagram. The workflow comprises two phases: training the classifier and classification of new instances. Both phases share common steps as you can see in the following diagram:

First, we use a set of Training data, select a representative subset as the training set, preprocess missing data, and extract features. A selected supervised learning algorithm is used to train a model, which is deployed in the second phase. The second phase puts a new data instance through the same Pre-processing and Feature extraction procedure and applies the learned model to obtain the instance label. If you are able to collect new labeled data, periodically rerun the learning phase to retrain the model, and replace the old one with the retrained one in the classification phase.

Traditional machine learning architecture

Structured data, such as transactional, customer, analytical, and market data, usually resides within a local relational database. Given a query language, such as SQL, we can query the data used for processing, as shown in the workflow in the previous diagram. Usually, all the data can be stored in the memory and further processed with a machine learning library such as Weka, Java-ML, or MALLET.

A common practice in the architecture design is to create data pipelines, where different steps in the workflow are split. For instance, in order to create a client data record, we might have to scrap the data from different data sources. The record can be then saved in an intermediate database for further processing.

To understand how the high-level aspects of big data architecture differ, let's first clarify when is the data considered big?

Dealing with big data

Big data existed long before the phrase was invented, for instance, banks and stock exchanges have been processing billions of transactions daily for years, and airline companies companies have worldwide real-time infrastructure for operational management of passenger booking, and so on. So what is big data really? Doug Laney (2001) suggested that big data is defined by three Vs: volume, velocity, and variety. Therefore, to answer the question whether your data is big, we can translate this into the following three subquestions:

  • Volume: Can you store your data in memory?
  • Velocity: Can you process new incoming data with a single machine?
  • Variety: Is your data from a single source?

If you answered all the questions with yes, then your data is probably not big, do not worry, you have just simplified your application architecture.

If your answer to all the questions was no, then your data is big! However, if you have mixed answers, then it's complicated. Some may argue that a V is important, other may say the other Vs. From the machine learning point of view, there is a fundamental difference in algorithm implementation to process the data in memory or from distributed storage. Therefore, a rule of thumb is as follows: if you cannot store your data in the memory, then you should look into a big data machine learning library.

The exact answer depends on the problem that you are trying to solve. If you're starting a new project, I'd suggest you start off with a single-machine library and prototype your algorithm, possibly with a subset of your data if the entire data does not fit into the memory. Once you've established good initial results, consider moving to something more heavy duty such as Mahout or Spark.

Big data application architecture

Big data, such as documents, weblogs, social networks, sensor data, and others, are stored in a NoSQL database, such as MongoDB, or a distributed filesystem, such as HDFS. In case we deal with structured data, we can deploy database capabilities using systems such as Cassandra or HBase built atop Hadoop. Data processing follows the MapReduce paradigm, which breaks data processing problems into smaller subproblems and distributes tasks across processing nodes. Machine learning models are finally trained with machine learning libraries such as Mahout and Spark.


