© Stylianos Kampakis  2020
S. KampakisThe Decision Maker's Handbook to Data Sciencehttps://doi.org/10.1007/978-1-4842-5494-3_8

8. Problem Solving

Stylianos Kampakis1 
(1)
London, UK
 

In this chapter, we’re going to look at solving a problem from the decision maker’s point of view, as this is what this book is about. So, we aren’t going to look at how data scientists solve problems but how you will solve the problem working alongside a data scientist.

No one expects the domain expert—a.k.a. you—to be qualified in using data science. However, you need to be able to think like a data scientist when solving a problem, because it will help you improve how you define the problem, as well as making it easier for you to find the right people and to manage them.

You need to be able to pose a problem and, as previously mentioned, one highly effective approach is to turn the problem into a question. You also need to understand whether there’s any value in solving the problem. Everything takes time and money, so you need to understand whether the results of the problem will be applicable to your business. Lastly, you need to understand who you should hire to help you and if you need more than one person.

Understanding Whether a Problem Can Be Solved

To figure out if something can be solved, you first need to be aware of everything that can go wrong. From experience, I’ve discovered that businesses face two main issues when they want to use data science.

First, they often have the wrong expectations, so the actual solution can turn out to be either much easier or much more difficult than they think. Secondly, they don’t have the right data.

A data scientist can help you understand whether the problem can be solved at all, whether the problem is difficult or easy to solve, and how much time and resources will be required.

So, what you need to do is to find the right person—preferably someone you trust—and work alongside them.

What you should never ever do is make assumptions without consulting a data scientist! Many issues arise because people make erroneous assumptions, especially around technical matters, such as the complexity of the project or the timeframes involved. It’s better to ask if you are uncertain about something—and even when you’re not—than to make assumptions.

Quick Heuristics

We’re now going to look at a few heuristics to help you determine if data science can be applied to a specific problem.

First of all, you need to determine if the problem can be phrased as
  • A statistical modeling problem

  • A hypothesis test

  • A supervised learning problem

  • An unsupervised learning problem

Let’s take a look at some examples.

Statistical Modeling Problem

Statistical modeling should be used when you have a situation in which you are trying to determine whether variable X is important for variable Y and what the relationship between the two is. Statistical models can tell you which factors are significant and what the direction of effect is.

So, for example, you might have noticed that older people seem to like your product more. Thus, you might want to determine if age plays a role in the purchasing decision. In this case, you’ll want to hire a statistician since you will be using statistical modeling.

As it was mentioned before, the focus of statistics is on transparency. When you care about understanding the relationships between variables, you are probably going for statistics.

Hypothesis Testing

Hypothesis testing is the approach you want to turn to when you want to compare two groups. This is if you want to figure out if X and Y differ. An example of this is A/B testing.

Let’s say you want to roll out a new feature for your product and you don’t know if it’s a good idea. All you need to do is create two different groups and compare them with each other.

Likewise, you might want to compare two existing groups. For example, you might want to study gender, regardless of context. You might want to see if gender has an impact on sales, so you would create two groups, one of each gender, and make the comparison.

Hypothesis testing is closely related to statistical modeling, and sometimes the same problems can be answered through different hypothesis tests or models. In both this case and the previous one, you know you need a data scientist with skills in statistics.

Supervised Learning

This approach is a little more difficult. We’ve already mentioned two scenarios, namely, classification and regression, so these are clearly situations where you can use supervised learning. You have a dataset, and you want to use some input variables to be able to predict an outcome variable Y.

A good heuristic is that you can use supervised learning in any scenario where you want to automate something a human is doing. If you have someone sitting there detecting spam or labeling images and you have a dataset, you can feed that data to a machine along with examples from the human, and the machine will attempt to mimic the decision-making process of the human.

Also, supervised learning is a great option when you want to predict future values, that is, predictive analytics—predicting someone’s risk of bankruptcy in the future (i.e., their credit score), demand for products in the future, or the price of a stock.

Note that supervised learning algorithms are usually not very good in explaining how they reached a decision. Machine learning works well, but it is mainly a black box. Hence, if understanding your data is of paramount importance, you might have to perform exploratory data analysis or create some statistical models as well, such as linear or logistic regression.

Unsupervised Learning

Unsupervised learning can be useful in quite a few scenarios. First, the obvious one is that unsupervised learning is useful when you know there are groups in your data. For example, customer segmentation is such a scenario.

Dimensionality reduction is another use case. If you have a very complicated dataset with a lot of variables and you aren’t sure what to do with it, then turning to dimensionality reduction techniques, such as factor analysis, is a good idea.

Besides the fact that unsupervised learning can help you understand the data better, the outputs of unsupervised learning can also be used as inputs to supervised models to improve performance.

A Few More Heuristics

So, if you have images and audio that need to be classified, then you will turn to deep learning. The latter has been commoditized to a significant degree and tech giants like Google and IBM offer deep learning as a service.

If you have a B2C business, you will have to look into recommender systems and market-basket analysis. Recommender systems have become the norm for consumers.

Market-basket analysis (or association rule mining) represents a set of algorithms that try to identify frequent patterns in order to find associations between the features of a user and their behavior. These are patterns of the form; if someone buys X, then they are likely to buy Y.

If you want to develop a chatbot, there are platforms out there to help you do that, such as Dialogflow.1 It should be noted that while chatbots have come a long way, they aren’t as advanced as many people claim them to be. Some intelligence and machine learning go into them, but they still have a long way to go. However, they aren’t difficult to build.

Finally, sometimes you want to forecast future values of a time series. Forecasting time series is a special field on its own. Statistics has a subfield that deals with these types of problems, but people also turn to machine learning to answer these types of questions. So, in practice, you might need a blend of two approaches. Because forecasting is very challenging and there are many ways to make mistake, I usually advise people to try and see whether the same problem can be treated as classification or regression.

When Heuristics Fail

The aforementioned heuristics don’t work every time. However, when they don’t work, it’s usually because of one of the following.

A Vague Project Plan

Heuristics will fail if your project plan is vague or ridiculously grandiose. You hear people saying things like they want to build an AI capable of evaluating business plans with forecasts 10 years into the future and so on or that they want their AI to be able to predict everything.

Clearly, they’ve watched one too many episodes of Star Trek and have forgotten that technology isn’t quite that advanced yet. It’s certainly not a magic bullet capable of fixing any and every problem, and this holds true for machine learning and AI too.

So, it’s much more effective to have a plan with a narrow focus, which also keeps things more manageable and realistic in terms of deliverables.

Developing Skynet to Kill a Fly

Another problem is when you try to develop an AI to get rid of a fly and it decides to use the entire world’s nuclear arsenal to do so, also known as trying to use super advanced machine learning to do something a human can do easily.

For example, maybe you want to create a chatbot that understands conversation and books your appointments. It’s a great idea, but you have to consider the time and resources it would take to develop and if the end result is something people would actually use. You can still go ahead, of course, but you need to be aware of the difficulties involved in developing this type of software.

Another fairly complicated idea is trying to automate data collection. It makes sense to try to automate data collection in different contexts, but in most cases, it’s more effective to spend a little money to crowdsource it.

I’ve seen quite a few startups fail because though their idea clearly had value, the development costs in terms of time and money were too large for a newly founded company. And they were trying to solve problems that can be solved more easily by humans, even if it is slightly more expensive.

Lack of the Right Data

Another common problem is that businesses try to hire data scientists when they don’t have the right data or the data is in very poor shape. Then the data scientists spend an inordinate amount of time trying to whip the data into shape instead of actually working toward a solution.

Other Considerations

There are a few other things you need to consider, namely, data quality, data volume in terms of number of variables, and data size.

Data can go wrong in many situations. You might not have a very small sample, for example, or maybe the quality is bad, or maybe you don’t have the right variables.

Someone once asked me to examine the differences between gender and pay on their job platform. Unfortunately, gender was a variable they didn’t collect so we only had first and last names to work with. The solution was to attempt to guess gender from someone’s first name, but 30% of the names were unisex, so we had to scrap the data.

Their approach didn’t make sense at all. If they knew they were interested in differences between genders, why not ask people their gender? It makes no sense at all and is downright ridiculous.

You definitely don’t want to take the same approach. So, try to follow the principles we discussed in Chapter 2 and try to think about how you will be using your data before you start collecting it. Failing to plan is planning to fail. That’s why it’s good to have a data strategy from day 0.2

What Problem Do You Really Need to Solve?

Sometimes the problem you think you are solving is not the one you need to solve. Let’s say you want to predict the price of Bitcoin to make a profit. This is a very difficult problem—I’ve done a lot of work on it so I should know.

However, if your goal is to profit from Bitcoin, this is actually a different problem. Maybe you’re interested in whether the price is climbing or falling. Or maybe you want to use reinforcement learning to create an algorithm trading bot to do it for you.

The important bit here is that the actual problem you are trying to solve will influence the final success of your model! Good data scientists know how to pick their battles wisely, so as to maximize success. We saw an example earlier where the data acquisition considerations interact with the business plan. The same holds here.

Always make sure to discuss these issues with the data scientist. I’ve seen months being wasted, because the client said they needed to do one thing (e.g., regression), but in reality the problem would have been solved more easily if the data scientist had tried to solve a different problem (e.g., turn the regression problem into a three-class classification problem).

EXAMPLE: THERE ARE MANY WAYS TO REACH THE TOP

A proverb says “there are many ways to the top of the mountain, but the view from the top is always the same.” This couldn’t be truer than for the case of data science. As it was explained in the chapter, sometimes a problem is not a problem. It’s not uncommon for a business to confuse what they want with what they need.

I had a client once who asked me if I knew how to code a recurrent neural network. I answered yes, but I was curious to see why he was so specific about it. It looks like he wanted to do image recognition. The thing with image recognition and deep learning is that, first, you are most likely to use convolutional neural networks. Second, there are many architectures published by big companies, such as Google, and research groups and you might be better of using one of these.

Finally, he wanted the image recognition in order to use it in a recommender system. The truth is that for this particular use case, he could have asked the users to provide tags about images. Hence, image recognition was not an essential part of the problem, which was about suggesting objects to users with similar properties to the ones they liked in the past.

Being too specific when you are not in possession of the full facts can only cause confusion. There are other cases where the same thing can happen. I already gave an example in finance. Another example relates to medical applications.

Let’s say that you are interested in sports injuries and you are interested in predicting when a player will get injured. There are different ways to approach the problem:
  • Regression—How many days someone will play until they get injured?

  • Classification—Break time down in three or four categories (e.g., 1 week, 1 month, etc.) and try to treat this as a classification problem.

  • Survival modeling—Similar to regression, but different models and assumptions are used in this case.

The performance of each one of those models will be different, and they might interact in different ways with your problem. For example, classification has the worst resolution (1 week can be a lengthy interval in this case), but easier to solve. Survival modeling is more suited than regression in this case and has more sound assumptions behind it, but there is much fewer survival models than models for regression.

As always, there is no free lunch. Every choice has trade-offs, that’s why it is important for a data scientist to be aware of how the results from the models can be used to solve the actual problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.21.173