© Stylianos Kampakis  2020
S. KampakisThe Decision Maker's Handbook to Data Sciencehttps://doi.org/10.1007/978-1-4842-5494-3_5

5. Thinking like a Data Scientist (Without Being One)

Stylianos Kampakis1 
(1)
London, UK
 

In this chapter, we’re going to talk about how to think like a data scientist without being one. It’s probably the most exciting and interesting chapter of the book, but also the most important because it translates very technical terms and ideas into English. It offers a great view into the technical world of data science but without all the jargon, so you can understand what’s happening.

The Data Science Process

So, if I ask the question

What is data science about?

I’ll receive different answers because, no surprise, different people have different opinions. For example, the boss might think that data science is actually a magical money machine that can print money whenever he or she wants it to.

To customers, data science might make you seem like a wizard or a telepath or psychic because you can tell so much about them and solve so many of their problems.

Software engineers think that data science is just importing some stuff from libraries and running a few functions, because of course no one is quite as awesome as they are.

In reality, though, data science is like an orchestra made up of different elements, including infrastructure, software, data sources, and statistics.

From a pragmatic point of view, there are five steps of a data science life cycle:
  1. 1.

    Collecting data

     
  2. 2.

    Organizing data

     
  3. 3.

    Analyzing data

     
  4. 4.

    Interpreting data

     
  5. 5.

    Communicating findings from data

     

We already covered collecting and organization data in the previous chapters; therefore, we’re going to look at steps 3–5 in this chapter and we’ll cover the topics in enough depth so that you understand how it all works.

Keep in mind, though, that data scientists are often called in from step 3 to step 5, because the collection and organization has already been done. Now, that doesn’t mean it’s always been done properly, but it has been done.

Defining the Data Science Process

The data science process has four steps, namely:
  • Step 1—Defining the problem

  • Step 2—Choosing the right data

  • Step 3—Solving the problem

  • Step 4—Creating value through actionable insights

The process involves two main actors, namely, the domain expert and the data scientist.

The domain expert is the person who owns the problem. It could be the business owner who hired the data scientist, but it could also be the head of a department that has to work with the data scientist. In an academic setting, it could be a researcher.

Essentially, the domain expert is the person who is responsible for solving the problem and who has to report either to himself (as a business owner or entrepreneur) or to upper management. It’s also the person who knows all the specifics of the domain.

Thus, the first step, namely, defining the problem, lies with the domain expert. Subsequently, choosing the right data is the responsibility of both actors, while the third step falls to the data scientist. Finally, creating value involves both actors working together.

So, let’s take a closer look at all of these.

Step 1: Defining the Problem

The domain expert is in charge of defining the problem because he or she is the one who understands the domain better than anyone else. The issue is that domain experts often don’t know data science, which can create problems.

And that’s exactly what this section is about. It’s about understanding why you need data science, how you can use it, and under what circumstances it becomes relevant.

If you have a data scientist in your company who has been working with you for a long time, they likely have a good knowledge of the domain. This means they can define their own problems.

However, in most cases, it’s the domain expert who lays out the problem. They might want to forecast sales for the month, or they might say that the recommender system needs to be improved, for example.

Once the problem has been defined, it’s time to move on to the next step.

Step 2: Choosing the Right Data

The domain expert needs to understand data collection and management, which we discussed in the previous section. He or she is responsible for making sure the principles on data collection and management we covered previously are being followed so that the data is as clean as possible, the standards are documented, and the data is easy to use.

The data scientist, on the other hand, needs to understand the domain and any peculiarities it might have. We touched on this in the data management section too.

So, each domain has its own set of pros and cons. Thus, in some cases, you can have very noisy data, like if you work with sensor data. In other cases, data might be inaccurate, such as the data collected through polling and surveys as people might not be completely honest in their answers. And the data scientist needs to know these things.

Step 3: Solving the Problem

The domain expert’s main contribution to this step is ensuring they hire the right person, which we will discuss in the next section of the book. They’re also responsible for ensuring that the company has the right culture, which we will cover in the last chapter of the book.

The actual solving of the problem is the data scientist’s job, and there is little that the domain expert can do other than provide support.

Step 4: Creating Value Through Actionable Insights

This step is the responsibility of both the data scientist and the domain expert. It requires a good collaboration between the two parties because the domain expert needs to know how to work with the data scientist and also to understand what the result of the work entails.

At the same time, the data scientist also needs to understand by now some of the issues of the business and of the particular field.

At this point, it really pays off if the person who is the domain expert, such as the business owner, the manager, or the decision maker, has a good understanding of data science.

It also helps build a data science culture, but we’ll cover this more in depth in the final section. The right culture can make it easier for the data scientist to learn the peculiarities of the field and the issues of the business. This leads to a data scientist who can take a more creative approach to developing solutions to help the business.

Solving a Problem Using the Data Science Process

We’re going to look at an example. It’s a very simple issue, but it is the best way to show how a problem can be solved using the data science process.

Step 1: Defining the Problem

So, the first step is to define the problem. Take something you find interesting at work. You could even decide to focus on something that bothers you.

I’m going to choose a trivial example, but one that most people can relate to, namely, meetings that consistently start late.

The first thing to do is to frame the problem. A great trick is to turn the statement of the problem into a question and write it down. So, in our case, it would be

Meetings always seem to start late. Is that really true?

You want to take this approach because while most data scientists like concrete terms, most non-data scientists tend to think in terms that are vaguer. However, models, be they machine learning, statistical, or any other type, have a narrow focus and can answer one, two, or three specific questions. So, when you write down the problem as a question, it helps the data scientist develop a better model to reach the solution.

Step 2: Choosing the Right Data

This step will have two parts, namely, Part A where you think about the data and Part B where you collect the data.

Part A: Think About the Data

At this point, you need to consider what data you will require to answer the question you framed in Step 1. And then you have to decide how you will collect the relevant data.

You start by writing down all the relevant definitions and thinking about the protocol you will use to collect the data. This is when the problems start, and you realize that even for simple problems there might be a wide range of different issues.

So, in our example, we want to study meetings that start late. To do this, though, we need to define when the meeting actually starts. It might seem simple at first, but if you really think about it, the definition isn’t all that clear.

Thus, does the meeting start when someone says it’s time to get started? Or is it the time the meeting was scheduled for in the calendar? Or is it when the actual work of the meeting starts—you know, after all the chitchat and enquiring about the boss’ kids, dog, cat, hamster, and crocodile? Or is all that small talk that takes place in the first few minutes part of the meeting too? Some might think that bit is a waste of time, while others might find it important for relationship building.

It might seem pretty trivial, but all these things are important because things can go wrong. The main risk is that you use the wrong definitions and they prevent you from solving the problem.

For example, meetings might consistently start late because of small talk. However, when you wrote down the definition, you didn’t factor in this small talk and, therefore, recorded a different time for the start of the meeting. So, you can’t solve the problem because the cause is completely missing from the data. As a result, you have to go back and collect data again, thereby wasting time, energy, and resources.

Part B: Collect the Data

Now that you’ve defined your data, it’s time to move on to the collection part of the project. The first thing to consider is that you need data you can trust, which can be more difficult than it sounds.

Data can suffer from a wide range of issues such as missing values, erroneous entries, definition issues, and much more. And these are pretty frequent problems. We covered this in the previous section as well as how to prevent or fix these problems.

Due to these problems, it’s a good idea to be flexible and adaptable. For example, it could be a good idea to modify your definition and the data collection protocol as you go along if things aren’t progressing as they should.

So, come back to our meeting example. Let’s say your initial definition doesn’t include the “small talk” portion of the meeting. However, you realize it’s extremely relevant since it delays the start of the actual meeting by approximately 30 minutes. Well, you can’t solve the problem if the data doesn’t show it exists, so you go back and change the definition but also how you collect the data.

What is vital, though, is that there is complete transparency and everything is documented regarding what you changed, how you changed, when you changed it, and so on. This is important because you could end up with situations like variables meaning completely different things but using the same terms to define them.

For example, let’s say you don’t document your decision to include small talk as part of the meeting. The time you record as the start time changes, but no one knows about it because you didn’t document the change.

Then, you comb through the data with the data scientist, who can’t figure out why nothing works properly.

The reason is, of course, that the variable referring to the start time has two different definitions depending on when it was recorded during the project, namely, before or after you made the change.

Effectively, you’re dealing with two variables and not one, which can seriously mess up any statistical model.

This is just one example of how things can go wrong when you don’t document everything you are doing. Many more things can occur, leading to disastrous results.

As you aren’t a data science expert, there’s no way for you to foresee what problems might arise with the data, which is why you want to be fully transparent regarding what’s going on and to document every tiny little thing that you do.

Remember, you always want to make the data scientist’s life as easy as possible because it will take them less time to get up to speed and it will enable him or her to develop a more effective solution, helping you save time and money while greatly improving efficiency and productivity.

Step 3: Solving the Problem

In this phase, there’s not a lot you can do as the domain expert since it’s mainly the data scientist’s job. However, let’s take a quick look at what this step usually entails anyway just so you can get an idea of what’s going on.

Usually, most data scientists like to start by performing exploratory data analysis. They use summary metrics, graphs, and so on to gain a better understanding of the problem. In some cases, they might even present some early findings to the stakeholders.

So, in our example, a data scientist might say something like:

“Over a two-week period, 10% of the meetings started on time. On average, meetings started 12 minutes late.”

Or, he or she might discover that meetings usually started late when they covered topics X and Y and when certain people attended, like the manager who loves to tell everyone what he did over the weekend in minute detail and no one can seem to get him to stop talking.

The data scientist will then build a model to find a solution, but models are mainly used to answer specific questions. You can have a primary question, which in our example could be

Are people usually late for meetings?

But you can also have a number of secondary ones. In our example, they could be along the lines of
  • Does the subject being covered affect how long a meeting runs?

  • Do the people in attendance have an effect on how late the meeting is?

  • Are late meetings counterbalanced by meetings that finish early?

  • What is the overall meeting time? Can you describe this in terms of a distribution?

Of course, all these questions are related and that means more than one can be answered. Keep in mind, though, that when you frame your problem as a question, you are helping the data scientist. He or she will be better able to understand how to convert vague data into a model that can be used to solve the problem.

Step 4: Creating Value Through Actionable Insights

By this point, the data scientist has built one or more models to answer the aforementioned questions, so we can move on to the final step, namely, to extract actionable insights from the models.

Essentially, the models are useless if you don’t act on them. The first step to taking action is to understand the impact of the results. A good way to do that is to translate the findings into monetary terms.

In other words, if the time period studied for late meetings is typical, then each employee loses an hour per day, which costs the company $X per year.

Another approach to understanding the impact of the results is by identifying specific individuals or culprits behind the problem.

So, in our example, we could say that person X is responsible for late meetings, so what can we do to solve the issue? Could we help them manage their time better? Would that help fix the problem?

Based on this understanding you gain from the results, you can then take action.

It’s also a good idea to go beyond the results, though. This cycle of modeling, translating the results, and then taking action can also be a very good learning experience for you. When you go through this process, other questions pop up. This is the point when some of those secondary questions mentioned in step 3 arise.

It also makes sense to build models for these secondary questions too since the data is already there and the data scientist has gained experience with the data. It pays to make the most out of the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.201.57