Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

Conditioning Your Data

IN THIS CHAPTER

Working with NumPy and pandas

Working with symbolic variables

Considering the effect of dates

Fixing missing data

Slicing, combining, and modifying data elements

The characteristics, content, type, and other elements that define your data in its entirety is the data shape. The shape of your data determines the kinds of tasks you can perform with it. In order to make your data amenable to certain types of analysis, you must shape it into a different form. Think of the data as clay and you as the potter, because that’s the sort of relationship that exists. However, instead of using your hands to shape the data, you rely on functions and algorithms to perform the task. This chapter helps you understand the tools you have available to shape data and the ramifications of shaping it.

Also in this chapter, you consider the problems associated with shaping. For example, you need to know what to do when data is missing from a dataset. It’s important to shape the data correctly or you end up with an analysis that simply doesn’t make sense. Likewise, some data types, such as dates, can present problems. Again, you need to tread carefully to ensure that you get the desired result so that the dataset becomes more useful and amenable to analysis of various sorts.

The goal of some types of data shaping is to create a larger dataset. In many cases, the data you need to perform an analysis doesn’t appear in a single database or in a particular form. You need to shape the data and then combine it so that you have a single dataset in a known format before you can begin the analysis. Combining data successfully can be an art form because data often defies simple analysis or quick fixes.

You don’t have to type the source code for this chapter in by hand. In fact, it’s a lot easier if you use the downloadable source. The source code for this chapter appears in the P4DS4D2_07_Getting_Your_Data_in_Shape.ipynb source code file; see the Introduction for the location of this file.

Make sure that the XMLData2.xml file that comes with the downloadable source code appears in the same folder (directory) as your Notebook files. Otherwise, the examples in the following sections fail with an input/output (I/O) error. The file location varies according to the platform you’re using. For example, on a Windows system, you find the notebooks stored in the C:UsersUsernameP4DS4D2 folder, where Username is your login name. (The book assumes that you’ve used the prescribed folder location of P4DS4D2, as described in the “Defining the code repository” section of Chapter 3.) To make the examples work, simply copy the file from the downloadable source folder into your Notebook folder. See the Introduction for instructions on downloading the source code.

Juggling between NumPy and pandas

There is no question that you need NumPy at all times. The pandas library is actually built on top of NumPy. However, you do need to make a choice between NumPy and pandas when performing tasks. You need the low-level functionality of NumPy to perform some tasks, but pandas makes things so much easier that you want to use it as often as possible. The following sections describe when to use each library in more detail.

Knowing when to use NumPy

Developers built pandas on top of NumPy. As a result, every task you perform using pandas also goes through NumPy. To obtain the benefits of pandas, you pay a performance penalty that some testers say is 100 times slower than NumPy for a similar task (see http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/). Given that computer hardware can make up for a lot of performance differences today, the speed issue may not be a concern at times, but when speed is essential, NumPy is always the better choice.

Knowing when to use pandas

You use pandas to make writing code easier and faster. Because pandas does a lot of the work for you, you could make a case for saying that using pandas also reduces the potential for coding errors. The essential consideration, though, is that the pandas library provides rich time-series functionality, data alignment, NA-friendly statistics, groupby, merge, and join methods. Normally, you need to code these features when using NumPy, which means you keep reinventing the wheel.

As the book progresses, you discover just how useful pandas can be performing such tasks as binning (a data preprocessing technique designed to reduce the effect of observational errors) and working with a dataframe (a two-dimensional labeled data structure with columns that can potentially contain different data types) so that you can calculate statistics on it. For example, in Chapter 9, you discover how to perform both discretization and binning. Chapter 13 shows actual binning examples, such as obtaining a frequency for each categorical variable of a dataset. In fact, many of the examples in Chapter 13 don’t work without binning. In other words, don’t worry too much right now about knowing precisely what binning is or why you need to use it — examples later in the book discuss the topic in detail. All you really need to know is that pandas does make your work considerably easier.

IT’S ALL IN THE PREPARATION

This book may seem to spend a lot of time massaging data and little time in actually analyzing it. However, the majority of a data scientist’s time is actually spent preparing data because the data is seldom in any order to actually perform analysis. To prepare data for use, a data scientist must:

Get the data
Aggregate the data
Create data subsets
Clean the data
Develop a single dataset by merging various datasets together

Fortunately, you don’t need to die of boredom while wading your way through these various tasks. Using Python and the various libraries it provides makes the task a lot simpler, faster, and more efficient, which is the point of spending all of the time on seemingly mundane topics in these early chapters. The better you know how to use Python to speed your way through these repetitive tasks, the sooner you begin having fun performing various sorts of analysis on the data.

Validating Your Data

When it comes to data, no one really knows what a large database contains. Yes, everyone has seen bits and pieces of it, but when you consider the size of some databases, viewing it all would be physically impossible. Because you don’t know what’s in there, you can’t be sure that your analysis will actually work as desired and provide valid results. In short, you must validate your data before you use it to ensure that the data is at least close to what you expect it to be. This means performing tasks such as removing duplicate records before you use the data for any sort of analysis (duplicates would unfairly weight the results).

However, you do need to consider what validation actually does for you. It doesn’t tell you that the data is correct or that there won’t be values outside the expected range. In fact, later chapters help you understand the techniques for handling these sorts of issues. What validation does is ensure that you can perform an analysis of the data and reasonably expect that analysis to succeed. Later, you need to perform additional massaging of the data to obtain the sort of results that you need in order to perform the task you set out to perform in the first place.

Figuring out what’s in your data

Figuring out what your data contains is important because checking data by hand is sometimes simply impossible due to the number of observations and variables. In addition, hand verifying the content is time consuming, error prone, and, most important, really boring. Finding duplicates is important because you end up

Spending more computational time to process duplicates, which slows your algorithms down.
Obtaining false results because duplicates implicitly overweight the results. Because some entries appear more than once, the algorithm considers these entries more important.

As a data scientist, you want your data to enthrall you, so it’s time to get it to talk to you — not figuratively, of course, but through the wonders of pandas, as shown in the following example:

from lxml import objectify

import pandas as pd

xml = objectify.parse(open('XMLData2.xml'))

root = xml.getroot()

df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))

for i in range(0,4):

obj = root.getchildren()[i].getchildren()

row = dict(zip(['Number', 'String', 'Boolean'],

[obj[0].text, obj[1].text,

obj[2].text]))

row_s = pd.Series(row)

row_s.name = i

df = df.append(row_s)

search = pd.DataFrame.duplicated(df)

print(df)

print()

print(search[search == True])

This example shows how to find duplicate rows. It relies on a modified version of the XMLData.xml file, XMLData2.xml, which contains a simple repeated row in it. A real data file contains thousands (or more) of records and possibly hundreds of repeats, but this simple example does the job. The example begins by reading the data file into memory using the same technique you explored in Chapter 6. It then places the data into a DataFrame.

At this point, your data is corrupted because it contains a duplicate row. However, you can get rid of the duplicated row by searching for it. The first task is to create a search object containing a list of duplicated rows by calling pd.DataFrame.duplicated(). The duplicated rows contain a True next to their row number.

Of course, now you have an unordered list of rows that are and aren’t duplicated. The easiest way to determine which rows are duplicated is to create an index in which you use search == True as the expression. Following is the output you see from this example. Notice that row 3 is duplicated in the DataFrame output and that row 3 is also called out in the search results:

Number String Boolean

0 1 First True

1 2 Second False

2 3 Third True

3 3 Third True

3 True

dtype: bool

Removing duplicates

To get a clean dataset, you want to remove the duplicates from it. Fortunately, you don’t have to write any weird code to get the job done — pandas does it for you, as shown in the following example:

from lxml import objectify

import pandas as pd

xml = objectify.parse(open('XMLData2.xml'))

root = xml.getroot()

df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))

for i in range(0,4):

obj = root.getchildren()[i].getchildren()

row = dict(zip(['Number', 'String', 'Boolean'],

[obj[0].text, obj[1].text,

obj[2].text]))

row_s = pd.Series(row)

row_s.name = i

df = df.append(row_s)

print(df.drop_duplicates())

As with the previous example, you begin by creating a DataFrame that contains the duplicate record. To remove the errant record, all you need to do is call drop_duplicates(). Here’s the result you get.

Number String Boolean

0 1 First True

1 2 Second False

2 3 Third True

Creating a data map and data plan

You need to know about your dataset — that is, how it looks statically. A data map is an overview of the dataset. You use it to spot potential problems in your data, such as

Redundant variables
Possible errors
Missing values
Variable transformations

Checking for these problems goes into a data plan, which is a list of tasks you have to perform to ensure the integrity of your data. The following example shows a data map, A, with two datasets, B and C:

import pandas as pd

pd.set_option('display.width', 55)

df = pd.DataFrame({'A': [0,0,0,0,0,1,1],

'B': [1,2,3,5,4,2,5],

'C': [5,3,4,1,1,2,3]})

a_group_desc = df.groupby('A').describe()

print(a_group_desc)

In this case, the data map uses 0s for the first series and 1s for the second series. The groupby() function places the datasets, B and C, into groups. To determine whether the data map is viable, you obtain statistics using describe(). What you end up with is a dataset B, series 0 and 1, and dataset C, series 0 and 1, as shown in the following output.

B

count mean std min 25% 50% 75% max

A

0 5.0 3.0 1.581139 1.0 2.00 3.0 4.00 5.0

1 2.0 3.5 2.121320 2.0 2.75 3.5 4.25 5.0

C

count mean std min 25% 50% 75% max

A

0 5.0 2.8 1.788854 1.0 1.00 3.0 4.00 5.0

1 2.0 2.5 0.707107 2.0 2.25 2.5 2.75 3.0

These statistics tell you about the two dataset series. The breakup of the two datasets using specific cases is the data plan. As you can see, the statistics tell you that this data plan may not be viable because some statistics are relatively far apart.

The default output from describe() shows the data unstacked. Unfortunately, the unstacked data can print out with an unfortunate break, making it very hard to read. To keep this from happening, you set the width you want to use for the data by calling pd.set_option(’display.width’, 55). You can set a number of pandas options this way by using the information found at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html.

Although the unstacked data is relatively easy to read and compare, you may prefer a more compact presentation. In this case, you can stack the data using the following code:

stacked = a_group_desc.stack()

print(stacked)

Using stack() creates a new presentation. Here’s the output shown in a compact form:

B C

A

0 count 5.000000 5.000000

mean 3.000000 2.800000

std 1.581139 1.788854

min 1.000000 1.000000

25% 2.000000 1.000000

50% 3.000000 3.000000

75% 4.000000 4.000000

max 5.000000 5.000000

1 count 2.000000 2.000000

mean 3.500000 2.500000

std 2.121320 0.707107

min 2.000000 2.000000

25% 2.750000 2.250000

50% 3.500000 2.500000

75% 4.250000 2.750000

max 5.000000 3.000000

Of course, you may not want all the data that describe() provides. Perhaps you really just want to see the number of items in each series and their mean. Here’s how you reduce the size of the information output:

print(a_group_desc.loc[:,(slice(None),['count','mean']),])

Using loc lets you obtain specific columns. Here’s the final output from the example showing just the information you absolutely need to make a decision:

B C

count mean count mean

A

0 5.0 3.0 5.0 2.8

1 2.0 3.5 2.0 2.5

Manipulating Categorical Variables

In data science, a categorical variable is one that has a specific value from a limited selection of values. The number of values is usually fixed. Many developers will know categorical variables by the moniker enumerations. Each of the potential values that a categorical variable can assume is a level.

To understand how categorical variables work, say that you have a variable expressing the color of an object, such as a car, and that the user can select blue, red, or green. To express the car’s color in a way that computers can represent and effectively compute, an application assigns each color a numeric value, so blue is 1, red is 2, and green is 3. Normally when you print each color, you see the value rather than the color.

If you use pandas.DataFrame (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html), you can still see the symbolic value (blue, red, and green), even though the computer stores it as a numeric value. Sometimes you need to rename and combine these named values to create new symbols. Symbolic variables are just a convenient way of representing and storing qualitative data.

When using categorical variables for machine learning, it’s important to consider the algorithm used to manipulate the variables. Some algorithms, such as trees and ensembles of three, can work directly with the numeric variables behind the symbols. Other algorithms, such as linear and logistic regression and SVM, require that you encode the categorical values into binary variables. For example, if you have three levels for a color variable (blue, red, and green), you have to create three binary variables:

One for blue (1 when the value is blue, 0 when it is not)
One for red (1 when the value is red, 0 when it is not)
One for green (1 when the value is green, 0 when it is not)

CHECKING YOUR VERSION OF PANDAS

The categorical variable examples in this section depend on your having a minimum version of pandas 0.23.0 installed on your system. However, your version of Anaconda may have a previous pandas version installed instead. Use the following code to check your version of pandas:

import pandas as pd

print(pd.__version__)

You see the version number of pandas you have installed. Another way to check the version is to open the Anaconda Prompt, type pip show pandas, and press Enter. If you have an older version, open the Anaconda Prompt, type pip install pandas --upgrade, and press Enter. The update process will occur automatically, along with a check of associated packages. When working with Windows, you may need to open the Anaconda Prompt using the Administrator option (right click the Anaconda Prompt entry in the Start menu and choose Run as Administrator from the context menu).

Creating categorical variables

Categorical variables have a specific number of values, which makes them incredibly valuable in performing a number of data science tasks. For example, imagine trying to find values that are out of range in a huge dataset. In this example, you see one method for creating a categorical variable and then using it to check whether some data falls within the specified limits:

import pandas as pd

car_colors = pd.Series(['Blue', 'Red', 'Green'],

dtype='category')

car_data = pd.Series(

pd.Categorical(

['Yellow', 'Green', 'Red', 'Blue', 'Purple'],

categories=car_colors, ordered=False))

find_entries = pd.isnull(car_data)

print(car_colors)

print()

print(car_data)

print()

print(find_entries[find_entries == True])

The example begins by creating a categorical variable, car_colors. The variable contains the values Blue, Red, and Green as colors that are acceptable for a car. Notice that you must specify a dtype property value of category.

The next step is to create another series. This one uses a list of actual car colors, named car_data, as input. Not all the car colors match the predefined acceptable values. When this problem occurs, pandas outputs Not a Number (NaN) instead of the car color.

Of course, you could search the list manually for the nonconforming cars, but the easiest method is to have pandas do the work for you. In this case, you ask pandas which entries are null using isnull() and place them in find_entries. You can then output just those entries that are actually null. Here’s the output you see from the example:

0 Blue

1 Red

2 Green

dtype: category

Categories (3, object): [Blue, Green, Red]

0 NaN

1 Green

2 Red

3 Blue

4 NaN

dtype: category

Categories (3, object): [Blue, Green, Red]

0 True

4 True

dtype: bool

Looking at the list of car_data outputs, you can see that entries 0 and 4 equal NaN. The output from find_entries verifies this fact for you. If this were a large dataset, you could quickly locate and correct errant entries in the dataset before performing an analysis on it.

Renaming levels

There are times when the naming of the categories you use is inconvenient or otherwise wrong for a particular need. Fortunately, you can rename the categories as needed using the technique shown in the following example.

import pandas as pd

car_colors = pd.Series(['Blue', 'Red', 'Green'],

dtype='category')

car_data = pd.Series(

pd.Categorical(

['Blue', 'Green', 'Red', 'Blue', 'Red'],

categories=car_colors, ordered=False))

car_colors.cat.categories = ["Purple", "Yellow", "Mauve"]

car_data.cat.categories = car_colors

print(car_data)

All you really need to do is set the cat.categories property to a new value, as shown. Here is the output from this example:

0 Purple

1 Yellow

2 Mauve

3 Purple

4 Mauve

dtype: category

Categories (3, object): [Purple, Yellow, Mauve]

Combining levels

A particular categorical level might be too small to offer significant data for analysis. Perhaps there are only a few of the values, which may not be enough to create a statistical difference. In this case, combining several small categories might offer better analysis results. The following example shows how to combine categories:

import pandas as pd

car_colors = pd.Series(['Blue', 'Red', 'Green'],

dtype='category')

car_data = pd.Series(

pd.Categorical(

['Blue', 'Green', 'Red', 'Green', 'Red', 'Green'],

categories=car_colors, ordered=False))

car_data = car_data.cat.set_categories(

["Blue", "Red", "Green", "Blue_Red"])

print(car_data.loc[car_data.isin(['Red'])])

car_data.loc[car_data.isin(['Red'])] = 'Blue_Red'

car_data.loc[car_data.isin(['Blue'])] = 'Blue_Red'

car_data = car_data.cat.set_categories(

["Green", "Blue_Red"])

print()

print(car_data)

What this example shows you is that there is only one Blue item and only two Red items, but there are three Green items, which places Green in the majority. Combining Blue and Red together is a two-step process. First, you add the Blue_Red category to car_data. Then you change the Red and Blue entries to Blue_Red, which creates the combined category. As a final step, you can remove the unneeded categories.

However, before you can change the Red entries to Blue_Red entries, you must find them. This is where a combination of calls to isin(), which locates the Red entries, and loc[], which obtains their index, provides precisely what you need. The first print statement shows the result of using this combination. Here’s the output from this example.

2 Red

4 Red

dtype: category

Categories (4, object): [Blue, Red, Green, Blue_Red]

0 Blue_Red

1 Green

2 Blue_Red

3 Green

4 Blue_Red

5 Green

dtype: category

Categories (2, object): [Green, Blue_Red]

Notice that there are now three Blue_Red entries and three Green entries. The Blue and Red categories are no longer in use. The result is that the levels are now combined as expected.

Dealing with Dates in Your Data

Dates can present problems in data. For one thing, dates are stored as numeric values. However, the precise value of the number depends on the representation for the particular platform and could even depend on the users’ preferences. For example, Excel users can choose to start dates in 1900 or 1904 (https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel). The numeric encoding for each is different, so the same date can have two numeric values depending on the starting date.

In addition to problems of representation, you also need to consider how to work with time values. Creating a time value format that represents a value the user can understand is hard. For example, you might need to use Greenwich Mean Time (GMT) in some situations but a local time zone in others. Transforming between various times is also problematic. With this in mind, the following sections provide you with details on dealing with time issues.

Formatting date and time values

Obtaining the correct date and time representation can make performing analysis a lot easier. For example, you often have to change the representation to obtain a correct sorting of values. Python provides two common methods of formatting date and time. The first technique is to call str(), which simply turns a datetime value into a string without any formatting. The strftime() function requires more work because you must define how you want the datetime value to appear after conversion. When using strftime(), you must provide a string containing special directives that define the formatting. You can find a listing of these directives at http://strftime.org/.

Now that you have some idea of how time and date conversions work, it’s time to see an example. The following example creates a datetime object and then converts it into a string using two different approaches:

import datetime as dt

now = dt.datetime.now()

print(str(now))

print(now.strftime('%a, %d %B %Y'))

In this case, you can see that using str() is the easiest approach. However, as shown by the following output, it may not provide the output you need. Using strftime() is infinitely more flexible.

2018-09-21 11:39:49.698891

Fri, 21 September 2018

Using the right time transformation

Time zones and differences in local time can cause all sorts of problems when performing analysis. For that matter, some types of calculations simply require a time shift in order to get the right results. No matter what the reason, you may need to transform one time into another time at some point. The following examples show some techniques you can employ to perform the task:

import datetime as dt

now = dt.datetime.now()

timevalue = now + dt.timedelta(hours=2)

print(now.strftime('%H:%M:%S'))

print(timevalue.strftime('%H:%M:%S'))

print(timevalue - now)

The timedelta() function makes the time transformation straightforward. You can use any of these parameter names with timedelta() to change a time and date value:

days
seconds
microseconds
milliseconds
minutes
hours
weeks

You can also manipulate time by performing addition or subtraction on time values. You can even subtract two time values to determine the difference between them. Here’s the output from this example:

11:42:22

13:42:22

2:00:00

Note that now is the local time, timevalue is two time zones different from this one, and there is a two-hour difference between the two times. You can perform all sorts of transformations using these techniques to ensure that your analysis always shows precisely the time-oriented values you need.

Dealing with Missing Data

Sometimes the data you receive is missing information in specific fields. For example, a customer record might be missing an age. If enough records are missing entries, any analysis you perform will be skewed and the results of the analysis weighted in an unpredictable manner. Having a strategy for dealing with missing data is important. The following sections give you some ideas on how to work through these issues and produce better results.

Finding the missing data

Finding missing data in your dataset is essential to avoid getting incorrect results from your analysis. The following code shows how you can obtain a listing of missing values without too much effort:

import pandas as pd

import numpy as np

s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])

print(s.isnull())

print()

print(s[s.isnull()])

A dataset can represent missing data in several ways. In this example, you see missing data represented as np.NaN (NumPy Not a Number) and the Python None value.

Use the isnull() method to detect the missing values. The output shows True when the value is missing. By adding an index into the dataset, you obtain just the entries that are missing. The example shows the following output:

0 False

1 False

2 False

3 True

4 False

5 False

6 True

dtype: bool

3 NaN

6 NaN

dtype: float64

Encoding missingness

After you figure out that your dataset is missing information, you need to consider what to do about it. The three possibilities are to ignore the issue, fill in the missing items, or remove (drop) the missing entries from the dataset. Ignoring the problem could lead to all sorts of problems for your analysis, so it’s the option you use least often. The following example shows one technique for filling in missing data or dropping the errant entries from the dataset:

import pandas as pd

import numpy as np

s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])

print(s.fillna(int(s.mean())))

print()

print(s.dropna())

The two methods of interest are fillna(), which fills in the missing entries, and dropna(), which drops the missing entries. When using fillna(), you must provide a value to use for the missing data. This example uses the mean of all the values, but you could choose a number of other approaches. Here’s the output from this example:

0 1.0

1 2.0

2 3.0

3 3.0

4 5.0

5 6.0

6 3.0

dtype: float64

0 1.0

1 2.0

2 3.0

4 5.0

5 6.0

dtype: float64

Technicalstuff Working with a series is straightforward because the dataset is so simple. When working with a DataFrame, however, the problem becomes significantly more complicated. You still have the option of dropping the entire row. When a column is sparsely populated, you might drop the column instead. Filling in the data also becomes more complex because you must consider the dataset as a whole, in addition to the needs of the individual feature.

Imputing missing data

The previous section hints at the process of imputing missing data (ascribing characteristics based on how the data is used). The technique you use depends on the sort of data you’re working with. For example, when working with a tree ensemble (you can find discussions of trees in the “Performing Hierarchical Clustering” section of Chapter 15 and the “Starting with a Plain Decision Tree” section of Chapter 20), you may simply replace missing values with a –1 and rely on the imputer (a transformer algorithm used to complete missing values) to define the best possible value for the missing data. The following example shows a technique you can use to impute missing data values:

import pandas as pd

import numpy as np

from sklearn.preprocessing import Imputer

s = [[1, 2, 3, np.NaN, 5, 6, None]]

imp = Imputer(missing_values='NaN',

strategy='mean', axis=0)

imp.fit([[1, 2, 3, 4, 5, 6, 7]])

x = pd.Series(imp.transform(s).tolist()[0])

print(x)

In this example, s is missing some values. The code creates an Imputer to replace these missing values. The missing_values parameter defines what to look for, which is NaN. You set the axis parameter to 0 to impute along columns and 1 to impute along rows. The strategy parameter defines how to replace the missing values. (You can discover more about the Imputer parameters at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html.).

mean: Replaces the values by using the mean along the axis
median: Replaces the values by using the medium along the axis
most_frequent: Replaces the values by using the most frequent value along the axis

Before you can impute anything, you must provide statistics for the Imputer to use by calling fit(). The code then calls transform() on s to fill in the missing values. In this case, the example needs to display the result as a series. To create a series, you must convert the Imputer output to a list and use the resulting list as input to Series(). Here’s the result of the process with the missing values filled in:

0 1

1 2

2 3

3 4

4 5

5 6

6 7

dtype: float64

Slicing and Dicing: Filtering and Selecting Data

You may not need to work with all the data in a dataset. In fact, looking at just one particular column might be beneficial, such as age, or a set of rows with a significant amount of information. You perform two steps to obtain just the data you need to perform a particular task:

Filter rows to create a subject of the data that meets the criterion you select (such as all the people between the ages of 5 and 10).
Select data columns that contain the data you need to analyze. For example, you probably don’t need the individuals’ names unless you want to perform some analysis based on name.

The act of slicing and dicing data, gives you a subset of the data suitable for analysis. The following sections describe various ways to obtain specific pieces of data to meet particular needs.

Slicing rows

Slicing can occur in multiple ways when working with data, but the technique of interest in this section is to slice data from a row of 2-D or 3-D data. A 2-D array may contain temperatures (x axis) over a specific time frame (y axis). Slicing a row would mean seeing the temperatures at a specific time. In some cases, you might associate rows with cases in a dataset.

A 3-D array might include an axis for place (x axis), product (y axis), and time (z axis) so that you can see sales for items over time. Perhaps you want to track whether sales of an item are increasing, and specifically where they are increasing. Slicing a row would mean seeing all the sales for one specific product for all locations at any time. The following example demonstrates how to perform this task:

x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],

[[11,12,13], [14,15,16], [17,18,19],],

[[21,22,23], [24,25,26], [27,28,29]]])

x[1]

In this case, the example builds a 3-D array. It then slices row 1 of that array to produce the following output:

array([[11, 12, 13],

[14, 15, 16],

[17, 18, 19]])

Slicing columns

Using the examples from the previous section, slicing columns would obtain data at a 90-degree angle from rows. In other words, when working with the 2-D array, you would want to see the times at which specific temperatures occurred. Likewise, you might want to see the sales of all products for a specific location at any time when working with the 3-D array. In some cases, you might associate columns with features in a dataset. The following example demonstrates how to perform this task using the same array as in the previous section:

x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],

[[11,12,13], [14,15,16], [17,18,19],],

[[21,22,23], [24,25,26], [27,28,29]]])

x[:,1]

Note that the indexing now occurs at two levels. The first index refers to the row. Using the colon (:) for the row means to use all the rows. The second index refers to a column. In this case, the output will contain column 1. Here’s the output you see:

array([[ 4, 5, 6],

[14, 15, 16],

[24, 25, 26]])

Remember This is a 3-D array. Therefore, each of the columns contains all the z axis elements. What you see is every row — 0 through 2 for column 1 with every z axis element 0 through 2 for that column.

Dicing

The act of dicing a dataset means to perform both row and column slicing such that you end up with a data wedge. For example, when working with the 3-D array, you might want to see the sales of a specific product in a specific location at any time. The following example demonstrates how to perform this task using the same array as in the previous two sections:

x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],

[[11,12,13], [14,15,16], [17,18,19],],

[[21,22,23], [24,25,26], [27,28,29]]])

print(x[1,1])

print(x[:,1,1])

print(x[1,:,1])

print()

print(x[1:2, 1:2])

This example dices the array in four different ways. First, you get row 1, column 1. Of course, what you may actually want is column 1, z axis 1. If that’s not quite right, you could always request row 1, z axis 1 instead. Then again, you may want rows 1 and 2 of columns 1 and 2. Here’s the output of all four requests:

[14 15 16]

[ 5 15 25]

[12 15 18]

[[[14 15 16]]]

Concatenating and Transforming

Data used for data science purposes seldom comes in a neat package. You may need to work with multiple databases in various locations — each of which has its own data format. It’s impossible to perform analysis on such disparate sources of information with any accuracy. To make the data useful, you must create a single dataset (by concatenating, or combining, the data from various sources).

Part of the process is to ensure that each field you create for the combined dataset has the same characteristics. For example, an age field in one database might appear as a string, but another database could use an integer for the same field. For the fields to work together, they must appear as the same type of information.

The following sections help you understand the process involved in concatenating and transforming data from various sources to create a single dataset. After you have a single dataset from these sources, you can begin to perform tasks such as analysis on the data. Of course, the trick is to create a single dataset that truly represents the data in all those disparate datasets — modifying the data would result in skewed results.

Adding new cases and variables

You often find a need to combine datasets in various ways or even to add new information for the sake of analysis purposes. The result is a combined dataset that includes either new cases or variables. The following example shows techniques for performing both tasks:

import pandas as pd

df = pd.DataFrame({'A': [2,3,1],

'B': [1,2,3],

'C': [5,3,4]})

df1 = pd.DataFrame({'A': [4],

'B': [4],

'C': [4]})

df = df.append(df1)

df = df.reset_index(drop=True)

print(df)

df.loc[df.last_valid_index() + 1] = [5, 5, 5]

print()

print(df)

df2 = pd.DataFrame({'D': [1, 2, 3, 4, 5]})

df = pd.DataFrame.join(df, df2)

print()

print(df)

The easiest way to add more data to an existing DataFrame is to rely on the append() method. You can also use the concat() method (a technique shown in Chapter 13). In this case, the three cases found in df are added to the single case found in df1. To ensure that the data is appended as anticipated, the columns in df and df1 must match. When you append two DataFrame objects in this manner, the new DataFrame contains the old index values. Use the reset_index() method to create a new index to make accessing cases easier.

You can also add another case to an existing DataFrame by creating the new case directly. Any time you add a new entry at a position that is one greater than the last_valid_index(), you get a new case as a result.

Sometimes you need to add a new variable (column) to the DataFrame. In this case, you rely on join() to perform the task. The resulting DataFrame will match cases with the same index value, so indexing is important. In addition, unless you want blank values, the number of cases in both DataFrame objects must match. Here’s the output from this example:

A B C

0 2 1 5

1 3 2 3

2 1 3 4

3 4 4 4

A B C

0 2 1 5

1 3 2 3

2 1 3 4

3 4 4 4

4 5 5 5

A B C D

0 2 1 5 1

1 3 2 3 2

2 1 3 4 3

3 4 4 4 4

4 5 5 5 5

Removing data

At some point, you may need to remove cases or variables from a dataset because they aren’t required for your analysis. In both cases, you rely on the drop() method to perform the task. The difference in removing cases or variables is in how you describe what to remove, as shown in the following example:

import pandas as pd

df = pd.DataFrame({'A': [2,3,1],

'B': [1,2,3],

'C': [5,3,4]})

df = df.drop(df.index[[1]])

print(df)

df = df.drop('B', 1)

print()

print(df)

The example begins by removing a case from df. Notice how the code relies on an index to describe what to remove. You can remove just one case (as shown), ranges of cases, or individual cases separated by commas. The main concern is to ensure that you have the correct index numbers for the cases you want to remove.

Removing a column is different. This example shows how to remove a column using a column name. You can also remove a column by using an index. In both cases, you must specify an axis as part of the removal process (normally 1). Here’s the output from this example:

A B C

0 2 1 5

2 1 3 4

A C

0 2 5

2 1 4

Sorting and shuffling

Sorting and shuffling are two ends of the same goal — to manage data order. In the first case, you put the data into order, while in the second, you remove any systematic patterning from the order. In general, you don’t sort datasets for the purpose of analysis because doing so can cause you to get incorrect results. However, you might want to sort data for presentation purposes. The following example shows both sorting and shuffling:

import pandas as pd

import numpy as np

df = pd.DataFrame({'A': [2,1,2,3,3,5,4],

'B': [1,2,3,5,4,2,5],

'C': [5,3,4,1,1,2,3]})

df = df.sort_values(by=['A', 'B'], ascending=[True, True])

df = df.reset_index(drop=True)

print(df)

index = df.index.tolist()

np.random.shuffle(index)

df = df.loc[df.index[index]]

df = df.reset_index(drop=True)

print()

print(df)

It turns out that sorting the data is a bit easier than shuffling it. To sort the data, you use the sort_values() method and define which columns to use for indexing purposes. You can also determine whether the index is in ascending or descending order. Make sure to always call reset_index() when you’re done so that the index appears in order for analysis or other purposes.

To shuffle the data, you first acquire the current index using df.index.tolist() and place it in index. A call to random.shuffle() creates a new order for the index. You then apply the new order to df using loc[]. As always, you call reset_index() to finalize the new order. Here’s the output from this example (but note that the second output may not match your output because it has been shuffled):

A B C

0 1 2 3

1 2 1 5

2 2 3 4

3 3 4 1

4 3 5 1

5 4 5 3

6 5 2 2

A B C

0 2 1 5

1 2 3 4

2 3 4 1

3 1 2 3

4 3 5 1

5 4 5 3

6 5 2 2

Aggregating Data at Any Level

Aggregation is the process of combining or grouping data together into a set, bag, or list. The data may or may not be alike. However, in most cases, an aggregation function combines several rows together statistically using algorithms such as average, count, maximum, median, minimum, mode, or sum. There are several reasons to aggregate data:

Make it easier to analyze
Reduce the ability of anyone to deduce the data of an individual from the dataset for privacy or other reasons
Create a combined data element from one data source that matches a combined data element in another source

The most important use of data aggregation is to promote anonymity in order to meet legal or other concerns. Sometimes even data that should be anonymous turns out to provide identification of an individual using the proper analysis techniques. For example, researchers have found that it’s possible to identify individuals based on just three credit card purchases (see https://www.computerworld.com/article/2877935/how-three-small-credit-card-transactions-could-reveal-your-identity.html for details). Here’s an example that shows how to perform aggregation tasks:

import pandas as pd

df = pd.DataFrame({'Map': [0,0,0,1,1,2,2],

'Values': [1,2,3,5,4,2,5]})

df['S'] = df.groupby('Map')['Values'].transform(np.sum)

df['M'] = df.groupby('Map')['Values'].transform(np.mean)

df['V'] = df.groupby('Map')['Values'].transform(np.var)

print(df)

In this case, you have two initial features for this DataFrame. The values in Map define which elements in Values belong together. For example, when calculating a sum for Map index 0, you use the Values 1, 2, and 3.

To perform the aggregation, you must first call groupby() to group the Map values. You then index into Values and rely on transform() to create the aggregated data using one of several algorithms found in NumPy, such as np.sum. Here are the results of this calculation:

Map Values S M V

0 0 1 6 2.0 1.0

1 0 2 6 2.0 1.0

2 0 3 6 2.0 1.0

3 1 5 9 4.5 0.5

4 1 4 9 4.5 0.5

5 2 2 7 3.5 4.5

6 2 5 7 3.5 4.5

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Conditioning Your Data

Create new playlist

Sign In

Sign Up

Conditioning Your Data

Juggling between NumPy and pandas

Knowing when to use NumPy

Knowing when to use pandas

Validating Your Data

Figuring out what’s in your data

Removing duplicates

Creating a data map and data plan

Manipulating Categorical Variables

Creating categorical variables

Renaming levels

Combining levels

Dealing with Dates in Your Data

Formatting date and time values

Using the right time transformation

Dealing with Missing Data

Finding the missing data

Encoding missingness

Imputing missing data

Slicing and Dicing: Filtering and Selecting Data

Slicing rows

Slicing columns

Dicing

Concatenating and Transforming

Adding new cases and variables

Removing data

Sorting and shuffling

Aggregating Data at Any Level

Table of Contents for
Chapter 7: Conditioning Your Data