Chapter 9
IN THIS CHAPTER
Putting data science problems and data into perspective
Defining and using feature creation to your benefit
Working with arrays
Previous chapters have all been preparatory in nature. You have discovered how to perform essential data science tasks using Python. In addition, you spent time working with the various tools that Python provides to make data science tasks easier. All this information is essential, but it doesn’t help you see the big picture — where all the pieces go. This chapter shows you how to employ the techniques you discovered in previous chapters to solve real data science problems.
The chapter begins by looking at the aspects you normally have to consider when trying to solve a data science problem. You can’t just jump in and start performing an analysis; you must understand the problem first, as well as consider the resources (in the form of data, algorithms, computational resources) to solve it. Putting the problem into a context, a setting of a sort, helps you understand the problem and define how the data relates to that problem. The context is essential because, like language, context alters the meaning of both the problem and its associated data. For example, when you say, “I have a red rose” to your significant other, the meaning behind the sentence has one connotation. If you say the same sentence to a fellow gardener, the connotation is different. The red rose is a sort of data and the person you’re speaking to is the context. There is no meaning to saying, “I have a red rose.” unless you know the context in which the statement is made. Likewise, data has no meaning; it doesn’t answer any question until you know the context in which the data is used. Saying “I have data” expresses a question, “What does the data mean?”
In the end, you’ll need one or more datasets. Two-dimensional datatables (datasets) consist of cases (the rows) and features (the columns). You can also refer to features as variables when using a statistical terminology. The features you decide to use for any given dataset determine the kinds of analysis you can perform, the ways in which you can manipulate the data, and ultimately the sorts of results you obtain. Determining what sorts of features you can create from source data and how you must transform the data to ensure that it works for the analysis you want to perform is an essential part of developing a data science solution.
After you get a picture of what your problem is, the resources you have to solve it, and the inputs you need to work with to solve it, you’re ready to perform some actual work. The last section of this chapter shows you how to perform simple tasks efficiently. You can usually perform tasks using more than one methodology, but when working with big data, the fastest routes are better. By working with arrays and matrices to perform specific tasks, you’ll notice that certain operations can take a long time unless you leverage some computational tricks. Using computational tricks is one of the most basic forms of manipulation you perform, but knowing about them from the beginning is essential. Applying these techniques paves the road to later chapters when you start to look at the magic that data science can truly accomplish in helping you see more in the data you have than is nominally apparent.
Putting your problem in the correct context is an essential part of developing a data science solution for any given problem and associated data. Data science is definitively applied science, and abstract manual approaches may not work all that well on your specific situation. Running a Hadoop cluster or building a deep neural network may sound cool in front of fellow colleagues and make you feel you are doing great data science projects, but they may not provide what you need to solve your problem. Putting the problem in the correct context isn’t just a matter of deliberating whether to use a certain algorithm or that you must transform the data in a certain way — it’s the art of critically examining the problem and the available resources and creating an environment in which to solve the problem and obtain a desired solution.
When working through a data science problem, you need to start by considering your goal and the resources you have available for achieving that goal. The resources are data, computational resources such as available memory, CPUs, and disk space. In the real world, no one will hand you ready-made data and tell you to perform a particular analysis on it. Most of the time, you have to face completely new problems, and you have to build your solution from scratch. During your first evaluation of a data science problem, you need to consider the following:
https://zapatopi.net/kelvin/quotes/
). If you can measure performance, you can determine the impact of your work and even make a monetary estimation. Stakeholders will be delighted to know that you’ve figured out what to do and what benefits your data science project will bring about.Data science is a complex system of knowledge at the intersection of computer science, math, statistics, and business. Very few people can know everything about it, and, if someone has already faced the same problem or dilemmas as you face, reinventing the wheel makes little sense. Now that you have contextualized your project, you know what you’re looking for and you can search for it in different ways.
https://docs.scipy.org/doc/numpy/user/
), SciPy (https://docs.scipy.org/doc/
), pandas (https://pandas.pydata.org/pandas-docs/version/0.23.2/
), and especially Scikit-learn (https://scikit-learn.org/stable/user_guide.html
) have detailed in-line and online documentation with plenty of data science–related examples.https://www.quora.com/
), Stack Overflow (https://stackoverflow.com/
), and Cross Validated (https://stats.stackexchange.com/
) can provide you with plenty of answers to similar problems.https://scholar.google.it/
or Microsoft Academic Search at https://academic.microsoft.com/
. You can find a series of scientific papers that can tell you about preparing the data or detail the kind of algorithms that work better for a particular problem.At some point, you have everything you think you need to solve the problem. Of course, it’s a mistake to assume now that the solutions you create can actually solve the problem. You have a hypothesis, rather than a solution, because you have to demonstrate the efficacy of the potential solution in a scientific way. In order to form and test a hypothesis, you must train a model using a training dataset and then test it using an entirely different dataset. Later chapters in the book spend a great deal of time helping you through the process of training and testing the algorithms used to perform analysis, so don’t worry too much if you don’t understand this aspect of the process right now.
After you have some idea of the problem and its solution, you know the inputs required to make the algorithm work. Unfortunately, your data probably appears in multiple forms, you get it from multiple sources, and some data is missing entirely. Moreover, the developers of the features that existing data sources provide may have devised them for different purposes (such as accounting or marketing) than yours and you have to transform them so that you can use your algorithm at its fullest power. To make the algorithm work, you must prepare the data. This means checking for missing data, creating new features as needed, and possibly manipulating the dataset to get it into a form that your algorithm can actually use to make a prediction.
Features have to do with the columns in your dataset. Of course, you need to determine what those columns should contain. They might not end up looking precisely like the data in the original data source. The original data source may present the data in a form that leads to inaccurate analysis or even prevent you from getting a desired outcome because it’s not completely suited to your algorithm or your objectives. For example, the data may contain too much information redundancy inside multiple variables, which is a problem called multivariate correlation. The task of making the columns work in the best manner for data analysis purposes is feature creation (also called feature engineering). The following sections help you understand feature creation and why it’s important. (Future chapters provide all sorts of examples of how you actually employ feature creation to perform analysis.)
Feature creation may seem a bit like magic or weird science to some people, but it really does have a firm basis in math. The task is to take existing data and transform it into something that you can work with to perform an analysis. For example, numeric data could appear as strings in the original data source. To perform an analysis, you must convert the string data to numeric values in many cases. The immediate goal of feature creation is to achieve better performance from the algorithms used to accomplish the analysis than you can when using the original data.
In many cases, the transformation is less than straightforward. You may have to combine values in some way or perform math operations on them. The information you can access may appear in all sorts of forms, and the transformation process lets you work with the data in new ways so that you can see patterns in it. For example, consider this popular Kaggle competition: https://www.kaggle.com/c/march-machine-learning-mania-2015
. The goal is to use all sorts of statistics to determine who will win the NCAA Basketball Tournament. Imagine trying to derive disparate measures from public information on a match, such as the geographic location the team will travel to or the unavailability of key players, and you can begin to grasp the need to create features in a dataset.
Data often comes in a form that doesn’t work at all for an algorithm. Consider a simple real-life situation in which you need to determine whether one person can lift a board at a lumber yard. You receive two datatables. The first contains the height, width, thickness, and wood types of boards. The second contains a list of wood types and the amount they weigh per board foot (a piece of wood 12" x 12" x 1"). Not every wood type comes in every size, and some shipments come unmarked, so you don’t actually know what type of wood you’re working with. The goal is to create a prediction so that the company knows how many people to send to work with the shipments.
In this case, you create a two-dimensional dataset by combining variables. The resulting dataset contains only two features. The first feature contains just the length of the boards. It’s reasonable to expect a single person to carry a board that is up to ten feet long, but you want two people carrying a board ten feet or longer. The second feature is the weight of the board. A board that is 10 feet long, 12 inches wide, and 2 inches thick contains 20 board feet. If the board is made of ponderosa pine (with a board foot rating, BFR, of 2.67), the overall weight of the board is 53.4 pounds — one person could probably lift it. However, when the board is made of hickory (with a BFR of 4.25), the overall weight is now 85 pounds. Unless you have the Hulk working for you, you really do need two people lifting that board, even though the board is short enough for one person to lift.
Getting the first feature for your dataset is easy. All you need is the lengths of each of the boards that you stock. However, the second feature requires that you combine variables from both tables:
Length (feet) * Width (feet) * Thickness (inches) * BFR
The resulting dataset will contain the weight for each length of each kind of wood you stock. Having this information means that you can create a model that predicts whether a particular task will require one, two, or even three people to perform.
In order to perform some types of analysis, you need to break numeric values into classes. For example, you might have a dataset that includes entries for people from ages 0 to 80. To derive statistics that work in this case (such as running the Naïve Bayes algorithm), you might want to view the variable as a series of levels in ten-year increments. The process of breaking the dataset up into these ten-year increments is binning. Each bin is a numeric category that you can use.
Binning may improve the accuracy of predictive models by reducing noise or by helping model nonlinearity. In addition, it allows easy identification of outliers (values outside the expected range) and invalid or missing values of numerical variables.
Binning works exclusively with single numeric features. Discretization is a more complex process, in which you place combinations of values from different features in a bucket — limiting the number of states in any given bucket. In contrast to binning, discretization works with both numeric and string values. It’s a more generalized method of creating categories. For example, you can obtain a discretization as a byproduct of cluster analysis.
Indicator variables are features that can take on a value of 0 or 1. Another name for indicator variables is dummy variables. No matter what you call them, these variables serve an important purpose in making data easier to work with. For example, if you want to create a dataset in which individuals under 25 are treated one way and individuals 25 and over are treated another, you could replace the age feature with an indicator variable that contains a 0 when the individual is under 25 or a 1 when the individual is 25 and older.
A distribution is an arrangement of the values of a variable that shows the frequency at which various values occur. After you know how the values are distributed, you can begin to understand the data better. All sorts of distributions exist (see a gallery of distributions at https://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm
), and most algorithms can easily deal with them. However, you must match the algorithm to the distribution.
When working with distributions, you might find that the distribution of values is skewed in some way and that, because of the skewed values, any algorithm applied to the set of values produces output that simply won’t match your expectations. Transforming a distribution means to apply some sort of function to the values in order to achieve specific objectives, such as fixing the data skew, so that the output of your algorithm is closer to what you expected. In addition, transformation helps make the distribution friendlier, such as when you transform a dataset to appear as a normal distribution. Transformations that you should always try on your numeric features are
np.log(x)
and exponential np.exp(x)
1/x
, square root np.sqrt(x)
, and cube root x**(1.0/3.0)
x**2
, x**3
, and so onA basic form of data manipulation is to place the data in an array or matrix and then use standard math-based techniques to modify its form. Using this approach puts the data in a convenient form to perform other operations done at the level of every single observation, such as in iterations, because they can leverage your computer architecture and some highly optimized numerical linear algebra routines present in CPUs. These routines are callable from every operating system. The larger the data and the computations, the more time you can save. In addition, using these techniques also spare you writing long and complex Python code. The following sections describe how to work with arrays for data science purposes.
Your computer provides you with powerful routine calculations, and you can use them when your data is in the right format. NumPy’s ndarray
is a multidimensional data storage structure that you can use as a dimensional datatable. In fact, you can use it as a cube or even a hypercube when there are more than three dimensions.
Using ndarray
makes computations easy and fast. The following example creates a dataset of three observations with seven features for each observation. In this case, the example obtains the maximum value for each observation and subtracts it from the minimum value to obtain the range of values for each observation.
import numpy as np
dataset = np.array([[2, 4, 6, 8, 3, 2, 5],
[7, 5, 3, 1, 6, 8, 0],
[1, 3, 2, 1, 0, 0, 8]])
print(np.max(dataset, axis=1) - np.min(dataset, axis=1))
The print statement obtains the maximum value from each observation using np.max()
and then subtracts it from the minimum value using np.min()
. The maximum value in each observation is [8 8 8]
. The minimum value for each observation is [2 0 0]
. As a result, you get the following output:
[6 8 8]
Most operations and functions from NumPy that you apply to arrays leverage vectorization, so they’re fast and efficient — much more efficient than any other solution or handmade code. Even the simplest operations such as additions or divisions can take advantage of vectorization.
For instance, many times, the form of the data in your dataset won’t quite match the form you need. A list of numbers could represent percentages as whole numbers when you really need them as fractional values. In this case, you can usually perform some type of simple math to solve the problem, as shown here:
import numpy as np
a = np.array([15.0, 20.0, 22.0, 75.0, 40.0, 35.0])
a = a*.01
print(a)
The example creates an array, fills it with whole number percentages, and then uses 0.01 as a multiplier to create fractional percentages. You can then multiply these fractional values against other numbers to determine how the percentage affects that number. The output from this example is
[0.15 0.2 0.22 0.75 0.4 0.35]
The most efficient vectorization operations are matrix manipulations in which you add and multiply multiple values against other multiple values. NumPy makes performing multiplication of a vector by a matrix easy, which is handy if you have to estimate a value for each observation as a weighted summation of the features. Here’s an example of this technique:
import numpy as np
a = np.array([2, 4, 6, 8])
b = np.array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7]])
c = np.dot(a, b)
print(c)
Notice that the array
formatted as a vector must appear before the array
formatted as a matrix in the multiplication or you get an error. The example outputs these values:
[60 80 100 120]
To obtain the values shown, you multiply every value in the array
against the matching column
in the matrix; that is, you multiply the first value in the array
against the first column, first row of the matrix
. For example, the first value in the output is 2 * 1 + 4 * 2 + 6 * 3 + 8 * 4, which equals 60.
You can also multiply one matrix against another. In this case, the output is the result of multiplying rows in the first matrix against columns in the second matrix. Here is an example of how you multiply one NumPy matrix against another:
import numpy as np
a = np.array([[2, 4, 6, 8],
[1, 3, 5, 7]])
b = np.array ([[1, 2],
[2, 3],
[3, 4],
[4, 5]])
c = np.dot(a, b)
print(c)
In this case, you end up with a 2 x 2 matrix as output. Here are the values you should see when you run the application:
[[60 80]
[50 66]]
Each row in the first matrix is multiplied by each column of the second matrix. For example, to get the value 50 shown in row 2, column 1 of the output, you match up the values in row two of matrix a
with column 1 of matrix b
, like this: 1 * 1 + 3 * 2 + 5 * 3 + 7 * 4.
3.17.155.88