Scoring new data

Once a model is developed, whether it is a single model or a combination of models, it must be applied to new data to make predictions. In data mining, the act of using a model on new data is called deployment. There are many options for deploying Modeler models. In these final demonstrations, we will focus on how to score new records. It is important to note that scoring is only part of the Deployment phase, as deployment can be quite complex and can even involve migrating your model to a different software platform.

For instance, the IBM product Collaboration and Deployment Services (C&DS) is such a platform. Cloud based deployment in environments such as Amazon Web Services (AWS) is also possible. It is quite common to translate models (especially decision tree models) into another language such as SQL. There is even a special language, Predictive Modeling Markup Language (PMML) that is specifically designed for the purpose of migrating a model from one environment to another. Many companies, including IBM, are part of a group that supports PMML to facilitate this kind of collaboration between systems. All of this is under consideration during deployment because final deployment might involve automatically pulling new data nightly from a database and automatically populating an employee's screen to support a sales call. These complexities are beyond the scope of our Essentials guide, but serve to put scoring into a greater context.

Scoring is simply using our model (our model diamond) to calculate the model score on a record that was not present when the model was built. You do not need the modeling node for this part of the process; all you need is the generated model. It is helpful to reflect on the following fact—you can actually score a single record. You cannot build a model on a single record, but you can score one as we are simply plugging new values into a formula. To illustrate this process, we have created a typical deployment stream. Let's open and examine this stream:

Open the Scoring stream:

This stream contains only the essential nodes, including a source node for the data, Select and Distinct node to clean the data, Reclassify nodes to modify existing fields, Derive nodes to create new fields, a Type node to instantiate the data, and the second generated CHAID model (we will pretend this is the model we decided to keep). If we would like to use Modeler to make predictions on new data, this stream will be adequate.

Suppose that we are given new data and we want to make predictions on the Income_category field for each person. This new data has the same structure as the previous file. Of course, there is no Income_category field in this data file because that information is precisely what we want to predict.

Run the Table node at the beginning of the stream:

Notice that this is new data for only 1535 cases. Also, notice that there is no Income_category field in this data file. At this point, we are ready to make predictions for the new data. There is nothing special to do other than to run the data through the generated CHAID model. We'll look at the predictions with a Table node.

Run the Table node at the end of the stream:

New fields have been added to the data, including the model prediction ($R-Income_category), and the model confidence ($RC-Income_category). Model deployment can be as simple as this with Modeler. This type of deployment is often called batch scoring because we are making predictions for a group of records, rather than scoring individual records one at a time.

The trick to scoring is to make sure that all the fields are named exactly as they were when the model was created. This is because the generated model and any additional fields created are looking for fields named exactly what they were named initially.

Table of Contents for Scoring new data

Create new playlist

Sign In

Sign Up

Table of Contents for
Scoring new data