Chapter 10

Building a Predictive Model

IN THIS CHAPTER

Defining your business objective

Preparing your data

Developing, testing, and evaluating the model

Deploying and maintaining the model

Some claims are fraudulent. Some customers will churn. Some transactions are fraudulent. Some investments will be a loss. Some employees will leave. But the burning question in everyone's mind is: Which ones?

Building a predictive analytics model can help your business answer such questions. The model will look at the data you have about your customers, for example, and tell you the probability of customer churning. But such questions merely touch upon the surface of what predictive analytics can do; the potential applications of this fascinating discipline are endless.

As mentioned earlier in the book, a model is a mathematical representation of a real-world phenomenon we're interested in making sense of. For example, you can use the data you have to build a model that mimics the stock market in which your firm is actively engaged in all sort of trades — and then your job is to sort out the winning trades from the losing ones. In such a case, your model helps you select a strategy to make money by trading on the stock market.

Building a predictive analytics model becomes vital when the consequences of not acting — or of making the wrong decision — would be costly. Fraudulent transactions, for example, can drain resources such as time, money, and personnel; they can break the financial health of a company. Using predictive analytics to detect and counteract fraud then becomes part of your risk management strategy — and again, this function only scratches the surface of the potential value that a predictive model can bring to your company.

Getting Started

A predictive model combines mathematics and data to solve a business problem. The goal is to train the model to learn and create a mapping function between input data and the desired output or a target variable. It's a three-step process:

  1. Clearly define the business problem you're trying to solve.
  2. Collect all historical data you can put your hands on.

    That data will need some preprocessing to be used as part of the training and test data for your model. There are many algorithms and techniques for modeling. All commercial and open-source tools come equipped with the most common ones (see Chapter 6 and Chapter 7). You need to choose which method (or combination of methods) best fits your business need.

  3. Evaluate the model and measure its accuracy.

Predictive analytics aims at finding answers to business questions by examining data and presenting a range of possible outcomes, ranked with a score for each outcome. It helps organizations predict future outcomes and trends with confidence. It improves an organization's ability to plan, adopt, and execute strategies that improve its competitive edge. After your organization has put a predictive model in place, it then has the responsibility to act upon those findings.

Creating a successful predictive model involves these general steps:

  1. Define the business objectives. (See Chapter 8.)
  2. Prepare the data for use in the model. (See Chapter 9.)
  3. Apply statistical and/or data-mining algorithms to the data. (See Chapters 6 and 7.)
  4. Build, test, deploy, and maintain the model. (This chapter delves into this stage.)

Building a predictive model is the essence of predictive analytics. The specific model-building stage of the whole process is when you're ready to run some mathematical algorithms to see what interesting patterns and relationships you can find in the data. As you do so, keep these questions in mind:

  • How can you answer business questions?
  • What can you contribute to business decision-making?
  • How can you increase return on investment?

For you to be at this stage, you must already have convinced your management about the worthiness of predictive analytics. You've sealed the business case; met with management and stakeholders to ask all the relevant questions; and chosen to address one or a few related business questions. Then you've gathered a talented group of people: data scientists, IT personnel, and business experts, and formed your data analytical group.

Of course, the analytics team (which may consist of one person running from cubicle to office, verifying and asking questions at each step) have already done mighty deeds of preparation:

  • You've identified the data sources that you'll use to run the model. (See Chapter 3 and 9.)
  • You've performed any required data preprocessing (such as cleansing and integration), and created any derived data that you expect to have predictive power. (See Chapter 9 and 15.)
  • You've selected and identified all the variables. (See Chapter 9.)

At last, it's time to run the mathematical algorithm and see what you can learn from it.

The essential steps of the preparatory process leading up to this point are making the business case and preparing your data. Now for the fun part: running those specialized algorithms and see what you can learn.

We address a few practical questions here:

  • What is the process of running the algorithm?
  • How do I go about it?
  • Which algorithm do I choose?
  • How do I test my model?
  • What's next?

Building a predictive analytics model starts with clearly defining the business objectives you want it to attain, and then identifying and preparing the data that will be used to train and test the model.

Defining your business objectives

A predictive analytics model aims at solving a business problem or accomplishing a desired business outcome. Those business objectives become the model's goals. Knowing those ensures the business value of the model you build — which isn't to be confused with the accuracy of the model. After all, hypothetically you can build an accurate model to solve an imaginary business problem — but it's a whole other task to build a model that contributes to attaining business goals in the real world.

Defining the problem or the business need you want your model to solve is a vital first step in this process. A relevant and realistic definition of the problem will ensure that if you're successful in your endeavor of building this model and after it's used, it will add value to your business.

  • Which business problems would your stakeholders like to solve? Here are some immediately useful examples:
    • Classify transactions into legitimate versus fraudulent.
    • Identify the customers who are most likely to respond to a marketing campaign.
    • Identify what products to recommend to your customers.
    • Solve operational issues such as the optimal scheduling of employees' work — days or hours.
    • Cluster the patients according to their different stages of the disease.
    • Identify individualized treatments for patients.
    • Pick the next best-performing stock for today, the quarter, or the year.

The preceding list can be addressed through the creation of supervised models that predict specific outcomes as they pertain to the business needs.

Clearly defining the business problem you're trying to solve will help you verify the result or the outcome produced by your model. It will give business stakeholders and data scientists a clear understanding of what they are after, allowing them to better evaluate the quality of the solution.

In addition to defining the business objectives and the overall vision for your predictive analytics model, you need to define the scope of the overall project. Here are some general questions that must be answered at this stage:

  • If you develop your predictive model as a solution, you have another range of questions to address:
    • What would stakeholders do with that solution?
    • How would they use the model?
    • What is the current status with no model in place?
    • How is this business problem handled today?
    • What are the consequences of predicting the wrong solution?
    • What is the cost of a false positive?
    • How will the model be deployed?
    • Who is going to use the model?
    • How will the output of the model be represented?

Preparing your data

When you've defined the objectives of the model, the next step is to identify and prepare the data you'll use to build your model. (Chapter 9 addresses this step in detail.) This section touches upon the most important activities. The general sequence of steps looks like this:

  1. Identify your data sources.

    Data could be in different formats or reside in various locations.

  2. Identify how you will access that data.

    Sometimes, you would need to acquire third-party data, or data owned by a different division in your organization, etc.

  3. Consider which variables to include in your analysis.

    tip One standard approach is to start off with a wide range of variables and eliminate the ones that offer no predictive value for the model.

  4. Determine whether to use derived variables (See Chapter 9).

    In many cases, a derived variable (such as the price-per-earning ratio used to analyze stock prices) would have a greater direct impact on the model than would the raw variable.

  5. Explore the quality of your data, seeking to understand both its state and limitations.

    The accuracy of the model's predictions is directly related to the variables you select and the quality of your data. You would want to answer some data-specific questions at this point:

    • Is the data complete?
    • Does it have any outliers?
    • Does the data need cleansing?
    • Do you need to fill in missing values, keep them as they are, or eliminate them altogether?

Understanding your data and its properties can help you choose the algorithm that will be most useful in building your model. For example:

  • Regression algorithms can be used to analyze time-series data.
  • Classification algorithms can be used to analyze discrete data.
  • Association algorithms can be used for data with correlated attributes.

Individual algorithms and predictive techniques have different weaknesses and strengths. Most important, the accuracy of the model relies on having both a great quantity and quality of data. Your data should have a sufficient number of records to provide statistically meaningful results.

Gathering relevant data (preferably many records over a long time period), preprocessing, and extracting the features with most predictive values will be where you spend the majority of your time. But you still have to choose the algorithm wisely, an algorithm that should be suited to the business problem.

Data preparation is specific to the project you're working on and the algorithm you choose to employ. Depending on the project’s requirements, you will prepare your data accordingly and feed it to the algorithm as you build your model to address the business needs.

The dataset used to train and test the model must contain relevant business information to answer the problem you're trying to solve. If your goal is (for example) to determine which customer is likely to churn, then the dataset you choose must contain information about customers who have churned in the past in addition to customers who have not.

remember Some models created to mine data and make sense of its underlying relationships — for example, those built with clustering algorithms — needn't have a particular end result in mind.

Two problems arise when dealing with data as you're building your model: underfitting and overfitting.

Underfitting

Underfitting is when your model can't detect any relationships in your data. This is usually an indication that essential variables — those with predictive power — weren't included in your analysis.

If the variables used in your model don’t have high predictive power, then try adding new domain-specific variables and re-run your model. The end goal is to improve the performance of the model on the training data.

Another issue to watch for is seasonality (when you have seasonal pattern, if you fail to analyze multiple seasons you may get into trouble.) For example, a stock analysis that includes only data from a bull market (where overall stock prices are going up) doesn't account for crises or bubbles that can bring major corrections to the overall performance of stocks. Failing to include data that spans both bull and bear markets (when overall stock prices are falling) keeps the model from producing the best possible portfolio selection.

Overfitting

Overfitting is when your model includes data that has no predictive power but it's only specific to the dataset you're analyzing. Noise — random variations in the dataset — can find its way into the model, such that running the model on a different dataset produces a major drop in the model's predictive performance and accuracy. The accompanying sidebar provides an example.

If your model performs well on a particular dataset and only underperforms when you test it on a different dataset, suspect overfitting. (For more about overfitting, see Chapter 15.)

Choosing an algorithm

Various statistical, data-mining, and machine-learning algorithms are available for use in your model. You're in a better position to select an algorithm after you've defined the objectives of your model and selected the data you'll work on. Some of these algorithms were developed to solve specific business problems, enhance existing algorithms, or provide new capabilities — which may make some of them more appropriate for your purposes than others. You can choose from a range of algorithms to address business concerns such as the following:

  • For customer segmentation and/or community detection in the social sphere, for example, you'd need clustering algorithms.
  • For customer retention or to develop a recommender system, you'd use classification algorithms.
  • For credit scoring or predicting the next outcome of time-driven events, you'd use a regression algorithm.

As time and resources permit, you should run as many algorithms of the appropriate type as you can. Comparing different runs of different algorithms can bring surprising findings about the data or the business intelligence embedded in the data. Doing so gives you more detailed insight into the business problem, and helps you identify which variables within your data have predictive power.

Some predictive analytics projects succeed best by building an ensemble model, a group of models that operate on the same data. An ensemble model uses a predefined mechanism to gather outcomes from all its component models and provide a final outcome for the user.

Models can take various forms — a query, a collection of scenarios, a decision tree, or an advanced mathematical analysis. In addition, certain models work best for certain data and analyses. You can (for example) use classification algorithms that employ decision rules to decide the outcome of a given scenario or transaction, addressing questions like these:

  • Is this customer likely to respond to our marketing campaign?
  • Is this money-transfer likely to be part of a money-laundering scheme?
  • Is this loan applicant likely to default on the loan?

You can use unsupervised clustering algorithms to find what relationships exist within your dataset. (For more about the use of unsupervised clustering, see Chapter 6.) You can use these algorithms to find different groupings among your customers, determine what services can be grouped together, or decide for example which products can be upsold.

Regression algorithms can be used to forecast continuous data, such as predicting the trend for a stock movement given its past prices.

Decision trees, support vector machines, neural networks, logistic, and linear regressions are some of the most common algorithms, explained in detail in Chapter 6 and Chapter 7. Although their mathematical implementations differ, these predictive models generate comparable results. The decision trees are more popular, because they're easy to understand; you can follow the path to a given decision.

Classification algorithms are great for the type of analysis when the target is known (such as identifying spam emails). On the other hand, when the target variable is unknown, clustering algorithms are your best bet. They allow you to cluster or group your data into meaningful groups based on the similarities among the group members.

These algorithms are widely popular. There are many tools, both commercial and open-source, that implement them. With data accumulation thriving and accelerating (that is, big data), and cost-efficient hardware and platforms (such as cloud computing and Hadoop), predictive analytics tools are experiencing a boom.

Data and business objectives aren't the only factors to consider when you're selecting an algorithm. The expertise of your data scientists is of tremendous value at this point; picking an algorithm that will get the job done is often a tricky combination of science and art. The art part comes from experience and proficiency in the business domain, which also plays a critical role in identifying a model that can serve business objectives accurately.

Developing and Testing the Model

Let the magic begin! Model development starts at this stage, followed by testing the results of the runs. You use training and test datasets to align the model more closely to its business objectives, and refine the model's output through careful selection of variables, further training, and evaluating the output.

Developing the model

Developing a predictive model almost can never be a one-time deal; it requires an iterative process. You have to narrow the list of variables you're working with — starting with more variables than you think you'll need, and using multiple runs on the training dataset to narrow down the variables to those that truly count.

tip Run each algorithm several times while tweaking and changing the input variables fed to that model. With each run, you're examining a new hypothesis, changing input variables, and digging deeper for better solutions and more accurate predictions.

Okay, an iterative process can be overwhelming; you can easily lose track of what you've changed, or of what combinations of hypotheses and variables you've already run. Be sure to document each experiment thoroughly. Include inputs, the algorithm, and the outputs of each experiment. In addition, document any relevant observations you may have such as the specific assumptions made, your initial assessment, and the next step planned. This way you can avoid duplicate efforts.

remember Consulting with business domain experts at this stage can help you keep your model relevant while you're building it. Your domain experts can

  • Identify the variables that have greater predictive power.
  • Provide you with business language necessary for you to report your findings.
  • Help you explain the business significance of your preliminary results.

Testing the model

To be able to test the model you need to split your dataset into two sets: training and test datasets. These datasets should be selected at random and should be a good representation of the actual population.

tip Split your data: 70 percent to train the model and 30 percent to test the model. This ensures that the model is tested against data it hasn't seen before.

The following are guidelines as you split your data between training and test datasets:

  • Similar data should be used for both the training and test datasets.
  • Normally the training dataset is significantly larger than the test dataset.
  • Using the test dataset helps you avoid errors such as overfitting.
  • The trained model is run against test data to see how well the model will perform.

Some data scientists prefer to have a third dataset that has characteristics similar to those of the first two: a validation dataset. The idea is that if you're actively using your test data to refine your model, you should use a separate (third) set to check the accuracy of the model. Having a validation dataset, that wasn't used as part of the development process of your model, helps ensure a neutral estimation of the model's accuracy and efficacy.

tip If you've built multiple models using various algorithms, the validation sample can also help you evaluate which model performs best.

remember Make sure you double-check your work developing and testing the model. In particular, be skeptical if the performance or the accuracy of the model seems too good to be true. Errors can happen where you least expect them. Incorrectly calculating dates for time series data, for example, can lead to erroneous results.

Employing cross-validation

Cross-validation is a popular technique you can use to evaluate and validate your model. The same principle of using separate datasets for testing and training applies here: The training data is used to build the model; the model is run against the testing set to predict data it hasn't seen before, which is one way to evaluate its accuracy.

In cross-validation, the historical data is split into X numbers of subsets. Each time a subset is chosen to be used as test data, the rest of the subsets are used as training data. Then, on the next run, the former test set becomes one of the training sets and one of the former training sets becomes the test set. The process continues until every subset of that X number of sets has been used as a test set.

For example, imagine we have a dataset that we have divided into 5 sets numbered 1 to 5. In the first run, we use set 1 as the test set and use sets 2, 3, 4 and 5 as the training set. Then, on the second run, we use set 2 as the test set and sets 1, 3, 4, and 5 as training set. We continue this process until every subset of the 5 sets has been used as a test set.

Cross-validation allows you to use every data point in your historical data for both training and testing. This technique is more effective than just splitting your historical data into two sets, using the set with the most data for training, using the other set for testing, and leaving it at that. When you cross-validate your data, you're protecting yourself against randomly picking test data that's too easy to predict — which would give you the false impression that your model is accurate. Or, if you happen to pick test data that's too hard to predict, you might falsely conclude that your model isn't performing as you had hoped.

tip Cross-validation is widely used not only to validate the accuracy of models but also to compare the performance of multiple models.

Balancing bias and variance

Bias and variance are two sources of errors that can take place as you're building your analytical model.

Bias is the result of building a model that significantly simplifies the presentation of the relationships among data points in the historical data used to build the model.

Variance is the result of building a model that is explicitly specific to the data used to build the model.

Achieving a balance between bias and variance — by reducing the variance and tolerating some bias — can lead to a better predictive model. This trade-off usually leads to building less complex predictive models. A model with high complexity tends to have high variance and low bias. On the other hand, a very simple model tends to have high bias and low variance. Your goal is to build predictive models that have both low bias and low variance.

Many data-mining algorithms have been created to take into account this trade-off between bias and variance.

Troubleshooting ideas

When you're testing your model and you find yourself going nowhere, here are a few ideas to consider that may help you get back on track:

  • Always double-check your work. You may have overlooked something you assumed was correct but isn't. Such flaws could show up (for example) among the values of a predictive variable in your dataset, or in the preprocessing you applied to the data.
  • If the algorithm you chose isn't yielding any results, try another algorithm. For example, you try several classification algorithms available and depending on your data and the business objectives of your model, one of those may perform better than the others.
  • Try selecting different variables or creating new derived variables. Be always on the lookout for variables that have predictive power.
  • Frequently consult with the business domain experts who can help you make sense of the data, select variables, and interpret the model's results.

Evaluating the model

At this stage, you're trying to make sure that your model is accurate, can meet its business objective, and can be deployed.

Before presenting your findings, make sure that the steps you took to build the model are all correct. Verification of those steps, from data processing to data analysis, is essential.

After partitioning your data into training and test sets and performing a cross-validation, you still need to evaluate whether the model meets its business objectives and interpret its results in familiar business terms; domain experts can help with that.

It is important that you evaluate the outcome of your model and make sure that it meets the business needs you sought to achieve. It's equally important that you should be able to explain the results of your model using business terms that managements can relate to. You should be able to explain how these perditions will affect your business, how the stakeholders can benefit from their insights, and they can use them to make informed decisions.

Here's where you address the model's performance in terms of its speed and accuracy when deployed. You want to know how well your model will run against larger datasets — especially in the production environment.

To determine what measures you can take so as to judge the quality of your model, start by comparing the outputs of multiple models (or multiple versions of the same model). You want to have confidence that the model you built is sound, and that you can defend its findings. Be sure you can explain and interpret the results in business terms that the stakeholders can readily understand and apply.

tip You should be able to pinpoint why your model gives a particular recommendation or prediction. Doing so makes the model's outputs transparent, which allows business stakeholders to transform the predictions more easily into actionable decisions. If you're lucky, you may stumble over some new insights that only make sense after the model brings them to light. Those are rare, but when you get such results, the rewards can be substantial.

Going Live with the Model

After developing the model and successfully testing it, you're ready to deploy it into the production environment.

Deploying the model

The ultimate goal of a predictive analytics project is to put the model you build into the production process so it becomes an integral part of business decision-making.

The model can be deployed as a standalone tool or as part of an application. Either way, deploying the model can bring its own challenges.

  • Because the goal is to make use of the model's predictive results and act upon them, you have to come up with an efficient way to feed data to the model and retrieve results from the model after it analyzes that data.
  • Not all predictive decisions are automated; sometimes human intervention is needed. If a model flags a claim as fraudulent or high-risk, a claim processor may examine the claim more thoroughly and find it to be sound, saving the company from the loss of opportunity.

    tip The higher the stakes of a predictive decision, the more necessary it is to incorporate human oversight and approval into those decisions.

The model accrues real value only when it's incorporated into business processes — and when its predictions are turned into actionable decisions that drive business growth higher. This becomes especially useful when the deployed model provides recommendations, some of them in real time, during customers' interactions — or assesses risk during transactions, or evaluates business applications. This business contribution is especially powerful when repeated across several transactions, seamlessly.

The Predictive Model Markup Language (PMML) is a predictive model interchange format that allows you to use your favorite tool to develop your model, then directly deploy it. It's widely supported by analytics tools. PMML offers you the flexibility to develop a model in one environment, then deploy it into another without the need to write code or modify the application. As such, PMML is a standard way to represent your model, or your predictive analytics solution, independently of the environment in which is it was built.

Monitoring and maintaining the model

The longer the model is deployed, the more likely it will lose its predictive relevance as the environment changes. Business conditions are constantly changing, new data keeps coming in, and new trends are evolving. To keep your model relevant, monitor its performance and refresh it as necessary:

  • Run the deployed model on the newly acquired data.
  • Use new algorithms to refine the model's output.

A model tends to degrade over time. A successful model must be revisited, revaluated in light of new data and changing conditions, and probably retrained to account for the changes. Refreshing the model should be an ongoing part of the overall planning process. Refreshing the model will vary from simply tweaking the deployed model to completely rebuilding a new model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.237.24