CHAPTER 5

Descriptive versus Predictive Analytics

What Is Predictive Analytics and How Is It Different from Descriptive Analytics?

The case on descriptive analytics discussed in Chapter 4 explained how the data could be explored to reveal information. One of the best ways to learn from the data is to use visual techniques to create charts, graphs, and numerical summary. In case of big data, the dashboards are excellent tools to obtain relevant information from data. Data exploration provides invaluable information for future planning and setting goals for advancement of companies. The descriptive analytics is the process of exploring and explaining data to reveal trends and patterns and to obtain information not apparent otherwise. The objective is to obtain useful information that can help organizations achieve its goals. Predictive analytics is about identifying future business trends, creating, and describing predictive models to explore the trends and relationships. The descriptive analytics tools are useful in visualizing some of the trends and relationships among the variables, predictive analytics provides information on what types of predictive models can be used to predict the future business outcomes.

Exploring the Relationships between the Variables—Qualitative Tools

Before developing the data-driven models for prediction, it is important to establish and understand the relationships between variables. A number of qualitative or graphical tools—commonly known as the quality tools—can be used to study the relationships between the factors or independent variables and the response or dependent variable(s). Some of the tools used for this purpose are cause-and-effect diagrams, influence diagrams, and others. These are visual tools and are an easy way to gain a first look at the data. These are known as logic-driven models and are very helpful before a formal data-driven predictive model can be developed. Here we describe some of these tools.

An Example of Logic-Driven Model— Cause-and-Effect Diagram

Figure 5.1 shows a cause-and-effect diagram related to the production of a product. The probable reasons for each of the causes for different category are shown in Figure 5.1. This diagram is an excellent brainstorming tool to study the possible causes of a problem and finding their solutions.

image

Figure 5.1 Logic-driven model of predictive analytics

Data-Driven Predictive Models and Their Applications—Quantitative Models

The other types of models in predictive analytics are data-driven models. Predictive analytics uses advanced statistical tools, including a number of regression and forecasting models, data mining and machine learning, as well as information technology, and operations research methods to identify key independent variables and factors to build predictive models that are used to predict future business behavior using the independent variables, also known as predictors. Many of the trends and relationships among key factors obtained from descriptive analytics can be used to build predictive models to forecast future trends.

Prerequisites and Background for Predictive Analytics

The data-driven models in predictive analytics use a number of statistical and analytical models. Here we provide a list of commonly used predictive models and their applications. These models are prerequisites to the application of predictive modeling necessary to apply the models. Figure 5.2 briefly explain these predictive analytics models and tools. We will introduce each of these topics and their applications in detail in subsequent chapters. It is important to note that each of the topics is studied as a separate chapter in an analytics course. A brief description of the topics along with possible applications is explained in Table 5.1.

image

Figure 5.2 Prerequisite and models for predictive analytics

Table 5.1 Statistical models and prerequisites for predictive modeling

Statistical Tools and Models

Brief Description

Application Areas

Probability concepts

One of the main reasons for applying statistics in analytics is that statistics allows us to draw conclusions using limited data, that is, we can draw conclusions about the population using sample data.

The process of making inferences about the population using the sample involves uncertainty.

Probability is used in situations where uncertainty exits; it is the study of random phenomenon or events. A random event is an event in which the outcome cannot be predicted in advance.

Probability can tell us the likelihood of the random event(s).

In the decision-making process, uncertainty almost always exists. One question that is of usual concern when making decisions under uncertainty is the probability of success of the outcome.

Probability is used to answer the questions in the following situations:

What is the probability that the Reserve Bank will start raising the interest rate soon?

What is the probability that the Dow Jones stock index will go up by 15% by the end of this year?

What is the probability of me winning the Powerball lottery?

What is the probability that a customer will default on a loan and is a potential risk?

Probability distributions

Discrete and continuous probability distributions

Most processes produce random outcomes that can be described using a random variable.

A random variable can be discrete or continuous.

The random variable that can assume only a countable number of values is called a discrete random variable (e.g., number of defective products or number of voters). Random variables that are not countable but correspond to the points in an interval are known as continuous random variables (e.g., delivery time, length, diameter, volume).

Continuous random variables are infinite and uncountable.

These random variables are described using either a discrete or a continuous probability distribution depending on the nature of variable.

Distributions have wide applications in data analysis, decision making, and computer simulation.

Although probabilities are the way of dealing with uncertainty and the rules of probability provide us with a way of dealing with uncertainties, the concept and understanding of probability distributions is critical in modeling and decision making in analytics.

The probability distribution assigns probabilities to each value of a random variable.

A random variable is a variable that can take numerical values that are associated with the random outcomes of an experiment or a process under study. A random variable is usually an outcome of a statistical experiment or a process that generates random outcomes and can be modeled using a probability distribution—discrete or continuous depending on the outcome of variable (s) of interest. These distributions are used to determine the probabilities of outcomes and draw conclusions from the data.

The distributions are applied based on the trend or pattern from the data or when certain conditions are met. For example, a bell-shaped pattern from a data is usually described using a normal distribution, whereas customer arrival or calls coming to a call center is random and can be modeled using a Poisson distribution.

Computer simulation is often used to study the behavior of a call center or a drive through of fast-food restaurants. In such applications, the arrivals of calls in a call center and customer arrival in a drive through are modeled using some type of distribution.

The customer or calls arriving are random phenomenon that are random and can be modeled using a discrete distribution known as a Poisson distribution. In a fast-food drive through, the customer waiting time and service time can be described using a continuous distribution known as exponential distribution.

A very widely used continuous distribution in real world is normal or Gaussian distribution.

Some examples where this distribution can be applied are the length of time to assemble an electronic appliance, the life span of a satellite power source, fuel consumption in miles-per-gallon of new model of a car, the inside diameter of a manufactured cylinder, the waiting time of patients at an outpatient clinic, etc.

The normal distribution is a continuous distribution, whereas the Poisson distribution falls in the category of discrete distribution. There are a number of distributions used in data analytics.

Sampling and sampling distribution

Sampling is a systematic way of selecting a few items from the population. Samples are analyzed to draw conclusion(s) about the entire population.

In data analysis, we almost always rely on sample data to draw conclusion about the population from which the data are collected.

A manufacturer of computer and laser printers has determined that the assembly time of one of its computers is normally distributed with mean μ = 18 minutes and standard deviation σ = 4 minutes.

In most cases, the parameters of the population are unknown and we can estimate a population parameter using a sample statistic.

Sampling distribution is the probability distribution of a sample statistic.

Different sampling techniques are used to collect sample data. Sample statistics are then calculated to draw conclusion about the population. Sample statistic may be a sample mean image, sample variance s2, sample standard deviation s, or a sample proportion, image.

The central limit theorem has major applications in sampling and other areas of statistics and data analysis. It tells us that if we take a large sample, that is a sample size of 30 or more or (n 30), we can use the normal distribution to calculate the probability and draw conclusion about the population parameter.

Sample is a part of population. One of the obvious reasons of using samples in statistics and data analysis is because in most cases, the population can be huge and it is not practical to study the entire population.

Samples are used to make inferences about the population and this can be done through a sampling distribution.

Here we will try to answer questions related to sampling and surveys conducted in the real world.

In sampling theory, we need to consider several factors and answer questions, such as why do we use samples? Why do we need to have a homogeneous sample? What are different ways of taking samples? What is a sampling distribution and what is the purpose of it?

To improve this process, a random sample is used.

The analysis of the sample data can answer the probability that the mean assembly time will take longer than 19 minutes.

The other example involves predicting the presidential election using poll results. Polls are conducted to gather sample data and lots of planning goes into this.

Refer to Gallup polls in http://www.gallup.com that conducts and reports numerous poll results. They use sample data to predict the polls (the percent or proportion of voters that favor different candidates). This is an example of using a sample of voters to predict the population of voters who favor certain candidates.

In data analysis, sample data is used to draw conclusion about the population.

A population is described by its parameters (population parameters) and a sample is described by its statistics (sample statistics). It is important to note that a population parameter is always a constant, whereas a sample statistic is a random variable. Similar to other random variables, each sample statistic can be described using a probability distribution.

The first example studied the mean, whereas the poll example is about studying proportion.

Inference procedure: estimation and confidence intervals

Statistical inference: The objective of statistical inference is to draw conclusions or make decisions about a population based on the samples selected from the population.

The idea of drawing conclusion about the population parameters such as the population mean (μ), population variance (σ2), population proportion (p) comes under estimation where these population parameters are estimated using the sample statistics referred to as the sample mean (image), sample variance (s2), and the sample proportion (image).

The reason for estimating these parameters is that the values of the population parameters are unknown, and therefore, we must use the sample data to estimate them.

The parameters of a process are generally unknown; they change over time and must be estimated. This is done using inference procedures.

Statistical inference is an extremely important area of statistics and data analysis and is used to estimate the unknown population parameters.

For example, estimating the mean, μ, or the standard deviation, σ, using the corresponding sample statistic (sample mean, image, or the sample standard deviation, s). There are two major tools of inferential statistics: estimation and hypothesis testing. These techniques are the basis for many of the methods of data analysis and statistical quality control. Here we explain the concept of estimation.

A recent poll report entitled: “Link between exercising regularly and feeling good about appearance” came to the following conclusion: 56% of Americans who exercise two days per week feel good about their looks. This jumps to the 60% range among those who exercise three to six times per week. The results are based on telephone interviews conducted as part of the Gallup-Healthways Well-Being Index survey from January 1 to June 23, 2014 with a random sample of 85,143 adults aged 18 and older, living in all 50 U.S. states.

For the results based on the total sample of national adults, the margin of sampling error is ±5.0 percentage points at the 95% confidence level.

Estimation is the simplest form of inferential statistics in which a sample statistic is used to draw conclusions regarding the unknown population parameter.

Two types of estimates are used in parameter estimation: point estimate and interval estimate or confidence interval.

Parts of the claims made here may not make any sense and perhaps you are wondering about some of the statements. For example, what do the margin of sampling error of ±5.0 percentage points and a 95% confidence level mean? Also, how can using a sample of only a few thousand allows for a conclusion to be drawn about the entire population?

Estimation and confidence intervals answer the above questions. They enable us to draw conclusion(s) about the entire population using the sample from the population.

Inference procedure: hypothesis testing

Hypothesis testing is a major tool of inferential statistics that uses the information in the sample data to make a decision about a hypothesis. The hypothesis may be about a mean, proportion, variance, and so on. Suppose the Department of Labor claims that the average salary of graduates with a Data Analytics degree is $80,000, this can be written as a hypothesis.

To verify this claim, a sample of recent graduates may be evaluated and a conclusion can be reached about the validity of this claim. A hypothesis may test a claim, a design specification, a belief, or a theory, and sample data are used to verify these.

Here we extend the concept of inferential statistics to hypothesis testing. Hypothesis tests enable us to draw conclusions about a population by analyzing the information obtained from a sample.

A hypothesis test involves a statement about a population parameter (such as a population mean, or a population proportion). The test specifies a value for the parameter that lies in a region. Using the sample data, we must decide whether the hypothesis is consistent with or supported by the sample data.

Let us look into a real-world example. In recent years, there has been a great deal of interest in hybrid cars. Consumers are attracted to buy hybrids because of the high miles per gallon (mpg) these cars claim to provide. If you are interested in purchasing a hybrid, there are many makes and models from different manufacturers to choose from. It seems that just about every manufacturer offers a hybrid to compete in the growing market of hybrid cars. The following are the claims made by some of the manufactures of hybrid cars: Toyota Prius claims to provide about 50 mpg in the city and 48 mpg on the highway. It tops the list of fuel-efficient hybrids. The estimated annual fuel cost is less than $800.

Ford Fusion Hybrid claims to provide 41 mpg in the city and 36 mpg on the highway. The average annual fuel cost of less than $1,000 makes it attractive to customers.

Honda Civic Hybrid claims to provide 40 mpg in the city and 45 mpg on the highway. Estimated annual fuel costs are less than the Ford Fusion.

These days we find several claims like the ones above in consumer magazines and television commercials. Should the consumers believe these claims? Hypothesis testing may provide the answer to such questions. Hypothesis testing will enable us to make inferences about a population parameter by analyzing the difference between the stated population parameter value and the results obtained from the sample data. It is a very widely used technique in real world.

Correlation analysis

Correlation is a numerical measure of linear association between two variables. It provides the degree of association between two variables of interest and tells us how week or strong the relationship is between variables.

Coefficient of correlation is a numerical measure of the linear association between the two variables x and y. The correlation coefficient is denoted by rxy and its value is between −1 and +1. The rxy value tells us the degree of association between the two variables. It also tells us how strong or weak the correlation is between the two variables.

Suppose we are interested in investigating and knowing the strength of relationships between the sales and the profit or the relationship between the sales and advertisement expenditures for a company. Similarly, we can study the relationship between the home-heating cost and the average temperature using the correlation analysis. Usually, the first step in correlation analysis starts by constructing a scatter plot.

These plots are very useful in visualizing whether the relationship between the variables is positive or negative and linear or nonlinear. The next step is calculating a numerical measure. For example, if the calculated coefficient of correlation r = +0.902, it shows a very strong positive correlation between the sales and advertisement. These scatter plots are very helpful in describing bivariate relationships or the relationship between the two quantitative variables and can be easily created using computer packages. These are very helpful in data analysis and model building.

Table 5.1 outlines the statistical tools, their brief description, and application areas of predictive analytics models.

The next chapter discusses the details of the above data-driven predictive models with applications.

Summary

Predictive analytics is about predicting the future business outcomes. This phase of analytics uses a number of models that can be divided into logic-driven models and data-driven models. We discussed both types of models and the difference between the two. The key discussion area of this chapter was to introduce the readers to a number of tools and statistical models—the understanding of which are critical in understanding and applying the predictive analytics models. These are background information and we call them prerequisite to predictive analytics. The chapter provided a brief description and application areas of prerequisite tools. These are probability concepts, probability distributions, sampling and sampling distributions, correlation analysis, estimation and confidence intervals, and hypothesis testing. These topics are investigated in detail in the Appendix that accompanies this book. The appendix is available as a free download.

Appendix A–D

The appendix contains the topics that are the prerequisite to data-driven predictive analytics models. The concepts discussed here are essential in applying predictive models. The Appendix A–D discuss the following statistical tools and models of business analytics: concept of probability, role of probability distributions in decision making, sampling and sampling distribution, inference procedures: estimation and confidence interval, and inference procedures for one and two-population parameters—hypothesis testing.

Note: The following chapters discuss predictive analytics models—regression analysis, modeling, time series forecasting, and data mining

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.97.109