Data mining

What is it?

Data mining is often used as a buzzword of generic description applied to any form of large-scale information processing, but this is not very accurate. Plus the term itself is actually a misnomer because it implies that the goal is the data extraction rather than the insights that data can yield.

More accurately, data mining is an analytic process designed to explore data, usually very large business-related data sets – also known as ‘big data’ – in search for commercially relevant insights, patterns or relationships between variables that can improve results and elevate performance.

Data mining is essentially a hybrid of artificial intelligence, statistics, database systems, database research and machine learning. And the actual process is the automatic or semi-automatic analysis of large data sets to extract previously unknown yet interesting patterns, anomalies or dependencies that could be exploited.

When do I use it?

The ultimate goal of data mining is prediction, so you would use data mining if you had large data sets and wanted to extract insights from that data that could help your business in the future.

Clearly in business, being able to predict the future is helpful and can not only reduce costs and assist with planning and strategy, but insights gained from data mining could potentially change the direction of the business.

Insights extracted from data mining can also guide decision making and reduce risk. It is important to appreciate, however, that data mining may throw up patterns, anomalies or inter-dependencies, but it will not necessarily tell you the reason for those patterns, anomalies or inter-dependencies. Additional analysis will be required if the ‘why’ is still important to you.

What business questions is it helping me to answer?

Data mining can help the decision maker to predict the future. It can help you to answer:

  • What are the key factors that our most profitable customers have in common?
  • How could we categorise our customers in the smart watch segment?
  • What factors are common in fraudulent transactions?
  • What are the key patterns people use to navigate our website?

How do I use it?

There are three stages in data mining:

  • the initial exploration;
  • model building and validation;
  • deployment.

Stage 1: Initial exploration

First you need to prepare the data, which involves cleaning the data, data transformations, selecting data subsets. Plus if the data sets are large and have large numbers of variable fields then some sort of preliminary feature selection will be required to bring the variables to a manageable range.

Then, depending on the nature of the analytic problem, initial exploration may involve a simple choice of straightforward predictors for a regression model all the way to elaborate exploratory analyses in order to identify the most relevant variables and determine the complexity and the general nature of the models that can be taken into account in the next stage.

Stage 2: Model building and validation

Next you need to consider the various models you’ve identified in stage one so you can choose the best one based on their predictive performance. This may sound like a simple operation, but in fact it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal – many of which are based on so-called ‘competitive evaluation of models’, i.e. applying different models to the same data set and then comparing their performance to choose the best.

Core techniques of predictive data mining that are the most popular include: Bagging, Boosting, Stacking and Meta-Learning. For more information on these see the ‘Statsoft’ website link at the end of this chapter.

Stage 3: Deployment

The final stage of data mining involves using the model selected as being the best from the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.

The best way to capitalise on data mining is to invest in one of the many data mining tools on the market.

Practical example

Data mining can throw up unusual and unexpected connections between variables that can then be exploited to increase results. As seen earlier in Chapter 3, using data mining Walmart discovered that the sale of Pop-Tarts increased whenever there was a hurricane warning.

Increased sales in flashlights may have been expected but why people suddenly felt the urge to stock up on sugary breakfast treats was not. But Walmart didn’t need to know ‘why’ there was a connection only that there was a connection.

By positioning the Pop-Tarts display at the front of the store whenever there was a hurricane or severe weather warning Walmart were then able to boost sales still further.

Tips and traps

More and more people are more and more concerned about the data that is held on them inside big business and what those businesses are doing with it. These concerns are only going to grow so always use data ethically and transparently. Tell your customers what you want to do with their data and make sure that the outcome delivers value to them as well as your business.

Consider anonymising the data so that the information is not traceable to a particular person. Often the insights are not customer specific. For example, Walmart didn’t need to know who bought Pop-Tarts in a hurricane they just needed to identify the trend to capitalise on it.

Never underestimate the value of the data you hold or your obligation to protect it. Data is the new currency and you need to protect the privacy of your customers internally and externally.

Further reading and references

Data mining is an advanced analytics method that is covered in more detail in many advanced statistics books and websites. See for example:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.