Chapter 7
IN THIS CHAPTER
Working with NumPy and pandas
Working with symbolic variables
Considering the effect of dates
Fixing missing data
Slicing, combining, and modifying data elements
The characteristics, content, type, and other elements that define your data in its entirety is the data shape. The shape of your data determines the kinds of tasks you can perform with it. In order to make your data amenable to certain types of analysis, you must shape it into a different form. Think of the data as clay and you as the potter, because that’s the sort of relationship that exists. However, instead of using your hands to shape the data, you rely on functions and algorithms to perform the task. This chapter helps you understand the tools you have available to shape data and the ramifications of shaping it.
Also in this chapter, you consider the problems associated with shaping. For example, you need to know what to do when data is missing from a dataset. It’s important to shape the data correctly or you end up with an analysis that simply doesn’t make sense. Likewise, some data types, such as dates, can present problems. Again, you need to tread carefully to ensure that you get the desired result so that the dataset becomes more useful and amenable to analysis of various sorts.
There is no question that you need NumPy at all times. The pandas library is actually built on top of NumPy. However, you do need to make a choice between NumPy and pandas when performing tasks. You need the low-level functionality of NumPy to perform some tasks, but pandas makes things so much easier that you want to use it as often as possible. The following sections describe when to use each library in more detail.
Developers built pandas on top of NumPy. As a result, every task you perform using pandas also goes through NumPy. To obtain the benefits of pandas, you pay a performance penalty that some testers say is 100 times slower than NumPy for a similar task (see http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/
). Given that computer hardware can make up for a lot of performance differences today, the speed issue may not be a concern at times, but when speed is essential, NumPy is always the better choice.
You use pandas to make writing code easier and faster. Because pandas does a lot of the work for you, you could make a case for saying that using pandas also reduces the potential for coding errors. The essential consideration, though, is that the pandas library provides rich time-series functionality, data alignment, NA-friendly statistics, groupby, merge, and join methods. Normally, you need to code these features when using NumPy, which means you keep reinventing the wheel.
As the book progresses, you discover just how useful pandas can be performing such tasks as binning (a data preprocessing technique designed to reduce the effect of observational errors) and working with a dataframe (a two-dimensional labeled data structure with columns that can potentially contain different data types) so that you can calculate statistics on it. For example, in Chapter 9, you discover how to perform both discretization and binning. Chapter 13 shows actual binning examples, such as obtaining a frequency for each categorical variable of a dataset. In fact, many of the examples in Chapter 13 don’t work without binning. In other words, don’t worry too much right now about knowing precisely what binning is or why you need to use it — examples later in the book discuss the topic in detail. All you really need to know is that pandas does make your work considerably easier.
When it comes to data, no one really knows what a large database contains. Yes, everyone has seen bits and pieces of it, but when you consider the size of some databases, viewing it all would be physically impossible. Because you don’t know what’s in there, you can’t be sure that your analysis will actually work as desired and provide valid results. In short, you must validate your data before you use it to ensure that the data is at least close to what you expect it to be. This means performing tasks such as removing duplicate records before you use the data for any sort of analysis (duplicates would unfairly weight the results).
Figuring out what your data contains is important because checking data by hand is sometimes simply impossible due to the number of observations and variables. In addition, hand verifying the content is time consuming, error prone, and, most important, really boring. Finding duplicates is important because you end up
As a data scientist, you want your data to enthrall you, so it’s time to get it to talk to you — not figuratively, of course, but through the wonders of pandas, as shown in the following example:
from lxml import objectify
import pandas as pd
xml = objectify.parse(open('XMLData2.xml'))
root = xml.getroot()
df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))
for i in range(0,4):
obj = root.getchildren()[i].getchildren()
row = dict(zip(['Number', 'String', 'Boolean'],
[obj[0].text, obj[1].text,
obj[2].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
search = pd.DataFrame.duplicated(df)
print(df)
print()
print(search[search == True])
This example shows how to find duplicate rows. It relies on a modified version of the XMLData.xml
file, XMLData2.xml
, which contains a simple repeated row in it. A real data file contains thousands (or more) of records and possibly hundreds of repeats, but this simple example does the job. The example begins by reading the data file into memory using the same technique you explored in Chapter 6. It then places the data into a DataFrame
.
At this point, your data is corrupted because it contains a duplicate row. However, you can get rid of the duplicated row by searching for it. The first task is to create a search object containing a list of duplicated rows by calling pd.DataFrame.duplicated()
. The duplicated rows contain a True
next to their row number.
Of course, now you have an unordered list of rows that are and aren’t duplicated. The easiest way to determine which rows are duplicated is to create an index in which you use search == True
as the expression. Following is the output you see from this example. Notice that row 3 is duplicated in the DataFrame
output and that row 3 is also called out in the search
results:
Number String Boolean
0 1 First True
1 2 Second False
2 3 Third True
3 3 Third True
3 True
dtype: bool
To get a clean dataset, you want to remove the duplicates from it. Fortunately, you don’t have to write any weird code to get the job done — pandas does it for you, as shown in the following example:
from lxml import objectify
import pandas as pd
xml = objectify.parse(open('XMLData2.xml'))
root = xml.getroot()
df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))
for i in range(0,4):
obj = root.getchildren()[i].getchildren()
row = dict(zip(['Number', 'String', 'Boolean'],
[obj[0].text, obj[1].text,
obj[2].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
print(df.drop_duplicates())
As with the previous example, you begin by creating a DataFrame that contains the duplicate record. To remove the errant record, all you need to do is call drop_duplicates()
. Here’s the result you get.
Number String Boolean
0 1 First True
1 2 Second False
2 3 Third True
You need to know about your dataset — that is, how it looks statically. A data map is an overview of the dataset. You use it to spot potential problems in your data, such as
Checking for these problems goes into a data plan, which is a list of tasks you have to perform to ensure the integrity of your data. The following example shows a data map, A, with two datasets, B and C:
import pandas as pd
pd.set_option('display.width', 55)
df = pd.DataFrame({'A': [0,0,0,0,0,1,1],
'B': [1,2,3,5,4,2,5],
'C': [5,3,4,1,1,2,3]})
a_group_desc = df.groupby('A').describe()
print(a_group_desc)
In this case, the data map uses 0s for the first series and 1s for the second series. The groupby()
function places the datasets, B and C, into groups. To determine whether the data map is viable, you obtain statistics using describe()
. What you end up with is a dataset B, series 0 and 1, and dataset C, series 0 and 1, as shown in the following output.
B
count mean std min 25% 50% 75% max
A
0 5.0 3.0 1.581139 1.0 2.00 3.0 4.00 5.0
1 2.0 3.5 2.121320 2.0 2.75 3.5 4.25 5.0
C
count mean std min 25% 50% 75% max
A
0 5.0 2.8 1.788854 1.0 1.00 3.0 4.00 5.0
1 2.0 2.5 0.707107 2.0 2.25 2.5 2.75 3.0
These statistics tell you about the two dataset series. The breakup of the two datasets using specific cases is the data plan. As you can see, the statistics tell you that this data plan may not be viable because some statistics are relatively far apart.
Although the unstacked data is relatively easy to read and compare, you may prefer a more compact presentation. In this case, you can stack the data using the following code:
stacked = a_group_desc.stack()
print(stacked)
Using stack()
creates a new presentation. Here’s the output shown in a compact form:
B C
A
0 count 5.000000 5.000000
mean 3.000000 2.800000
std 1.581139 1.788854
min 1.000000 1.000000
25% 2.000000 1.000000
50% 3.000000 3.000000
75% 4.000000 4.000000
max 5.000000 5.000000
1 count 2.000000 2.000000
mean 3.500000 2.500000
std 2.121320 0.707107
min 2.000000 2.000000
25% 2.750000 2.250000
50% 3.500000 2.500000
75% 4.250000 2.750000
max 5.000000 3.000000
Of course, you may not want all the data that describe()
provides. Perhaps you really just want to see the number of items in each series and their mean. Here’s how you reduce the size of the information output:
print(a_group_desc.loc[:,(slice(None),['count','mean']),])
Using loc
lets you obtain specific columns. Here’s the final output from the example showing just the information you absolutely need to make a decision:
B C
count mean count mean
A
0 5.0 3.0 5.0 2.8
1 2.0 3.5 2.0 2.5
In data science, a categorical variable is one that has a specific value from a limited selection of values. The number of values is usually fixed. Many developers will know categorical variables by the moniker enumerations. Each of the potential values that a categorical variable can assume is a level.
To understand how categorical variables work, say that you have a variable expressing the color of an object, such as a car, and that the user can select blue, red, or green. To express the car’s color in a way that computers can represent and effectively compute, an application assigns each color a numeric value, so blue is 1, red is 2, and green is 3. Normally when you print each color, you see the value rather than the color.
If you use pandas.DataFrame
(http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html
), you can still see the symbolic value (blue, red, and green), even though the computer stores it as a numeric value. Sometimes you need to rename and combine these named values to create new symbols. Symbolic variables are just a convenient way of representing and storing qualitative data.
When using categorical variables for machine learning, it’s important to consider the algorithm used to manipulate the variables. Some algorithms, such as trees and ensembles of three, can work directly with the numeric variables behind the symbols. Other algorithms, such as linear and logistic regression and SVM, require that you encode the categorical values into binary variables. For example, if you have three levels for a color variable (blue, red, and green), you have to create three binary variables:
Categorical variables have a specific number of values, which makes them incredibly valuable in performing a number of data science tasks. For example, imagine trying to find values that are out of range in a huge dataset. In this example, you see one method for creating a categorical variable and then using it to check whether some data falls within the specified limits:
import pandas as pd
car_colors = pd.Series(['Blue', 'Red', 'Green'],
dtype='category')
car_data = pd.Series(
pd.Categorical(
['Yellow', 'Green', 'Red', 'Blue', 'Purple'],
categories=car_colors, ordered=False))
find_entries = pd.isnull(car_data)
print(car_colors)
print()
print(car_data)
print()
print(find_entries[find_entries == True])
The example begins by creating a categorical variable, car_colors
. The variable contains the values Blue
, Red
, and Green
as colors that are acceptable for a car. Notice that you must specify a dtype
property value of category
.
The next step is to create another series. This one uses a list of actual car colors, named car_data
, as input. Not all the car colors match the predefined acceptable values. When this problem occurs, pandas outputs Not a Number (NaN
) instead of the car color.
Of course, you could search the list manually for the nonconforming cars, but the easiest method is to have pandas do the work for you. In this case, you ask pandas which entries are null using isnull()
and place them in find_entries
. You can then output just those entries that are actually null. Here’s the output you see from the example:
0 Blue
1 Red
2 Green
dtype: category
Categories (3, object): [Blue, Green, Red]
0 NaN
1 Green
2 Red
3 Blue
4 NaN
dtype: category
Categories (3, object): [Blue, Green, Red]
0 True
4 True
dtype: bool
Looking at the list of car_data
outputs, you can see that entries 0 and 4 equal NaN
. The output from find_entries
verifies this fact for you. If this were a large dataset, you could quickly locate and correct errant entries in the dataset before performing an analysis on it.
There are times when the naming of the categories you use is inconvenient or otherwise wrong for a particular need. Fortunately, you can rename the categories as needed using the technique shown in the following example.
import pandas as pd
car_colors = pd.Series(['Blue', 'Red', 'Green'],
dtype='category')
car_data = pd.Series(
pd.Categorical(
['Blue', 'Green', 'Red', 'Blue', 'Red'],
categories=car_colors, ordered=False))
car_colors.cat.categories = ["Purple", "Yellow", "Mauve"]
car_data.cat.categories = car_colors
print(car_data)
All you really need to do is set the cat.categories
property to a new value, as shown. Here is the output from this example:
0 Purple
1 Yellow
2 Mauve
3 Purple
4 Mauve
dtype: category
Categories (3, object): [Purple, Yellow, Mauve]
A particular categorical level might be too small to offer significant data for analysis. Perhaps there are only a few of the values, which may not be enough to create a statistical difference. In this case, combining several small categories might offer better analysis results. The following example shows how to combine categories:
import pandas as pd
car_colors = pd.Series(['Blue', 'Red', 'Green'],
dtype='category')
car_data = pd.Series(
pd.Categorical(
['Blue', 'Green', 'Red', 'Green', 'Red', 'Green'],
categories=car_colors, ordered=False))
car_data = car_data.cat.set_categories(
["Blue", "Red", "Green", "Blue_Red"])
print(car_data.loc[car_data.isin(['Red'])])
car_data.loc[car_data.isin(['Red'])] = 'Blue_Red'
car_data.loc[car_data.isin(['Blue'])] = 'Blue_Red'
car_data = car_data.cat.set_categories(
["Green", "Blue_Red"])
print()
print(car_data)
What this example shows you is that there is only one Blue
item and only two Red
items, but there are three Green
items, which places Green
in the majority. Combining Blue
and Red
together is a two-step process. First, you add the Blue_Red
category to car_data
. Then you change the Red
and Blue
entries to Blue_Red
, which creates the combined category. As a final step, you can remove the unneeded categories.
However, before you can change the Red
entries to Blue_Red
entries, you must find them. This is where a combination of calls to isin()
, which locates the Red
entries, and loc[]
, which obtains their index, provides precisely what you need. The first print
statement shows the result of using this combination. Here’s the output from this example.
2 Red
4 Red
dtype: category
Categories (4, object): [Blue, Red, Green, Blue_Red]
0 Blue_Red
1 Green
2 Blue_Red
3 Green
4 Blue_Red
5 Green
dtype: category
Categories (2, object): [Green, Blue_Red]
Notice that there are now three Blue_Red
entries and three Green
entries. The Blue
and Red
categories are no longer in use. The result is that the levels are now combined as expected.
Dates can present problems in data. For one thing, dates are stored as numeric values. However, the precise value of the number depends on the representation for the particular platform and could even depend on the users’ preferences. For example, Excel users can choose to start dates in 1900 or 1904 (https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel
). The numeric encoding for each is different, so the same date can have two numeric values depending on the starting date.
In addition to problems of representation, you also need to consider how to work with time values. Creating a time value format that represents a value the user can understand is hard. For example, you might need to use Greenwich Mean Time (GMT) in some situations but a local time zone in others. Transforming between various times is also problematic. With this in mind, the following sections provide you with details on dealing with time issues.
Obtaining the correct date and time representation can make performing analysis a lot easier. For example, you often have to change the representation to obtain a correct sorting of values. Python provides two common methods of formatting date and time. The first technique is to call str()
, which simply turns a datetime
value into a string without any formatting. The strftime()
function requires more work because you must define how you want the datetime
value to appear after conversion. When using strftime()
, you must provide a string containing special directives that define the formatting. You can find a listing of these directives at http://strftime.org/
.
Now that you have some idea of how time and date conversions work, it’s time to see an example. The following example creates a datetime
object and then converts it into a string using two different approaches:
import datetime as dt
now = dt.datetime.now()
print(str(now))
print(now.strftime('%a, %d %B %Y'))
In this case, you can see that using str()
is the easiest approach. However, as shown by the following output, it may not provide the output you need. Using strftime()
is infinitely more flexible.
2018-09-21 11:39:49.698891
Fri, 21 September 2018
Time zones and differences in local time can cause all sorts of problems when performing analysis. For that matter, some types of calculations simply require a time shift in order to get the right results. No matter what the reason, you may need to transform one time into another time at some point. The following examples show some techniques you can employ to perform the task:
import datetime as dt
now = dt.datetime.now()
timevalue = now + dt.timedelta(hours=2)
print(now.strftime('%H:%M:%S'))
print(timevalue.strftime('%H:%M:%S'))
print(timevalue - now)
The timedelta()
function makes the time transformation straightforward. You can use any of these parameter names with timedelta()
to change a time and date value:
days
seconds
microseconds
milliseconds
minutes
hours
weeks
You can also manipulate time by performing addition or subtraction on time values. You can even subtract two time values to determine the difference between them. Here’s the output from this example:
11:42:22
13:42:22
2:00:00
Note that now
is the local time, timevalue
is two time zones different from this one, and there is a two-hour difference between the two times. You can perform all sorts of transformations using these techniques to ensure that your analysis always shows precisely the time-oriented values you need.
Sometimes the data you receive is missing information in specific fields. For example, a customer record might be missing an age. If enough records are missing entries, any analysis you perform will be skewed and the results of the analysis weighted in an unpredictable manner. Having a strategy for dealing with missing data is important. The following sections give you some ideas on how to work through these issues and produce better results.
Finding missing data in your dataset is essential to avoid getting incorrect results from your analysis. The following code shows how you can obtain a listing of missing values without too much effort:
import pandas as pd
import numpy as np
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
print(s.isnull())
print()
print(s[s.isnull()])
A dataset can represent missing data in several ways. In this example, you see missing data represented as np.NaN
(NumPy Not a Number) and the Python None
value.
Use the isnull()
method to detect the missing values. The output shows True
when the value is missing. By adding an index into the dataset, you obtain just the entries that are missing. The example shows the following output:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
dtype: bool
3 NaN
6 NaN
dtype: float64
After you figure out that your dataset is missing information, you need to consider what to do about it. The three possibilities are to ignore the issue, fill in the missing items, or remove (drop) the missing entries from the dataset. Ignoring the problem could lead to all sorts of problems for your analysis, so it’s the option you use least often. The following example shows one technique for filling in missing data or dropping the errant entries from the dataset:
import pandas as pd
import numpy as np
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
print(s.fillna(int(s.mean())))
print()
print(s.dropna())
The two methods of interest are fillna()
, which fills in the missing entries, and dropna()
, which drops the missing entries. When using fillna()
, you must provide a value to use for the missing data. This example uses the mean of all the values, but you could choose a number of other approaches. Here’s the output from this example:
0 1.0
1 2.0
2 3.0
3 3.0
4 5.0
5 6.0
6 3.0
dtype: float64
0 1.0
1 2.0
2 3.0
4 5.0
5 6.0
dtype: float64
The previous section hints at the process of imputing missing data (ascribing characteristics based on how the data is used). The technique you use depends on the sort of data you’re working with. For example, when working with a tree ensemble (you can find discussions of trees in the “Performing Hierarchical Clustering” section of Chapter 15 and the “Starting with a Plain Decision Tree” section of Chapter 20), you may simply replace missing values with a –1 and rely on the imputer (a transformer algorithm used to complete missing values) to define the best possible value for the missing data. The following example shows a technique you can use to impute missing data values:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
s = [[1, 2, 3, np.NaN, 5, 6, None]]
imp = Imputer(missing_values='NaN',
strategy='mean', axis=0)
imp.fit([[1, 2, 3, 4, 5, 6, 7]])
x = pd.Series(imp.transform(s).tolist()[0])
print(x)
In this example, s
is missing some values. The code creates an Imputer
to replace these missing values. The missing_values
parameter defines what to look for, which is NaN
. You set the axis
parameter to 0 to impute along columns and 1 to impute along rows. The strategy
parameter defines how to replace the missing values. (You can discover more about the Imputer parameters at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
.).
mean
: Replaces the values by using the mean along the axismedian
: Replaces the values by using the medium along the axismost_frequent
: Replaces the values by using the most frequent value along the axisBefore you can impute anything, you must provide statistics for the Imputer
to use by calling fit()
. The code then calls transform()
on s
to fill in the missing values. In this case, the example needs to display the result as a series. To create a series, you must convert the Imputer
output to a list and use the resulting list as input to Series()
. Here’s the result of the process with the missing values filled in:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: float64
You may not need to work with all the data in a dataset. In fact, looking at just one particular column might be beneficial, such as age, or a set of rows with a significant amount of information. You perform two steps to obtain just the data you need to perform a particular task:
The act of slicing and dicing data, gives you a subset of the data suitable for analysis. The following sections describe various ways to obtain specific pieces of data to meet particular needs.
Slicing can occur in multiple ways when working with data, but the technique of interest in this section is to slice data from a row of 2-D or 3-D data. A 2-D array may contain temperatures (x axis) over a specific time frame (y axis). Slicing a row would mean seeing the temperatures at a specific time. In some cases, you might associate rows with cases in a dataset.
A 3-D array might include an axis for place (x axis), product (y axis), and time (z axis) so that you can see sales for items over time. Perhaps you want to track whether sales of an item are increasing, and specifically where they are increasing. Slicing a row would mean seeing all the sales for one specific product for all locations at any time. The following example demonstrates how to perform this task:
x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],
[[11,12,13], [14,15,16], [17,18,19],],
[[21,22,23], [24,25,26], [27,28,29]]])
x[1]
In this case, the example builds a 3-D array. It then slices row 1 of that array to produce the following output:
array([[11, 12, 13],
[14, 15, 16],
[17, 18, 19]])
Using the examples from the previous section, slicing columns would obtain data at a 90-degree angle from rows. In other words, when working with the 2-D array, you would want to see the times at which specific temperatures occurred. Likewise, you might want to see the sales of all products for a specific location at any time when working with the 3-D array. In some cases, you might associate columns with features in a dataset. The following example demonstrates how to perform this task using the same array as in the previous section:
x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],
[[11,12,13], [14,15,16], [17,18,19],],
[[21,22,23], [24,25,26], [27,28,29]]])
x[:,1]
Note that the indexing now occurs at two levels. The first index refers to the row. Using the colon (:) for the row means to use all the rows. The second index refers to a column. In this case, the output will contain column 1. Here’s the output you see:
array([[ 4, 5, 6],
[14, 15, 16],
[24, 25, 26]])
The act of dicing a dataset means to perform both row and column slicing such that you end up with a data wedge. For example, when working with the 3-D array, you might want to see the sales of a specific product in a specific location at any time. The following example demonstrates how to perform this task using the same array as in the previous two sections:
x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],
[[11,12,13], [14,15,16], [17,18,19],],
[[21,22,23], [24,25,26], [27,28,29]]])
print(x[1,1])
print(x[:,1,1])
print(x[1,:,1])
print()
print(x[1:2, 1:2])
This example dices the array in four different ways. First, you get row 1, column 1. Of course, what you may actually want is column 1, z axis 1. If that’s not quite right, you could always request row 1, z axis 1 instead. Then again, you may want rows 1 and 2 of columns 1 and 2. Here’s the output of all four requests:
[14 15 16]
[ 5 15 25]
[12 15 18]
[[[14 15 16]]]
Data used for data science purposes seldom comes in a neat package. You may need to work with multiple databases in various locations — each of which has its own data format. It’s impossible to perform analysis on such disparate sources of information with any accuracy. To make the data useful, you must create a single dataset (by concatenating, or combining, the data from various sources).
Part of the process is to ensure that each field you create for the combined dataset has the same characteristics. For example, an age field in one database might appear as a string, but another database could use an integer for the same field. For the fields to work together, they must appear as the same type of information.
The following sections help you understand the process involved in concatenating and transforming data from various sources to create a single dataset. After you have a single dataset from these sources, you can begin to perform tasks such as analysis on the data. Of course, the trick is to create a single dataset that truly represents the data in all those disparate datasets — modifying the data would result in skewed results.
You often find a need to combine datasets in various ways or even to add new information for the sake of analysis purposes. The result is a combined dataset that includes either new cases or variables. The following example shows techniques for performing both tasks:
import pandas as pd
df = pd.DataFrame({'A': [2,3,1],
'B': [1,2,3],
'C': [5,3,4]})
df1 = pd.DataFrame({'A': [4],
'B': [4],
'C': [4]})
df = df.append(df1)
df = df.reset_index(drop=True)
print(df)
df.loc[df.last_valid_index() + 1] = [5, 5, 5]
print()
print(df)
df2 = pd.DataFrame({'D': [1, 2, 3, 4, 5]})
df = pd.DataFrame.join(df, df2)
print()
print(df)
The easiest way to add more data to an existing DataFrame
is to rely on the append()
method. You can also use the concat()
method (a technique shown in Chapter 13). In this case, the three cases found in df
are added to the single case found in df1
. To ensure that the data is appended as anticipated, the columns in df
and df1
must match. When you append two DataFrame
objects in this manner, the new DataFrame
contains the old index values. Use the reset_index()
method to create a new index to make accessing cases easier.
You can also add another case to an existing DataFrame
by creating the new case directly. Any time you add a new entry at a position that is one greater than the last_valid_index()
, you get a new case as a result.
Sometimes you need to add a new variable (column) to the DataFrame
. In this case, you rely on join()
to perform the task. The resulting DataFrame
will match cases with the same index value, so indexing is important. In addition, unless you want blank values, the number of cases in both DataFrame
objects must match. Here’s the output from this example:
A B C
0 2 1 5
1 3 2 3
2 1 3 4
3 4 4 4
A B C
0 2 1 5
1 3 2 3
2 1 3 4
3 4 4 4
4 5 5 5
A B C D
0 2 1 5 1
1 3 2 3 2
2 1 3 4 3
3 4 4 4 4
4 5 5 5 5
At some point, you may need to remove cases or variables from a dataset because they aren’t required for your analysis. In both cases, you rely on the drop()
method to perform the task. The difference in removing cases or variables is in how you describe what to remove, as shown in the following example:
import pandas as pd
df = pd.DataFrame({'A': [2,3,1],
'B': [1,2,3],
'C': [5,3,4]})
df = df.drop(df.index[[1]])
print(df)
df = df.drop('B', 1)
print()
print(df)
The example begins by removing a case from df
. Notice how the code relies on an index to describe what to remove. You can remove just one case (as shown), ranges of cases, or individual cases separated by commas. The main concern is to ensure that you have the correct index numbers for the cases you want to remove.
Removing a column is different. This example shows how to remove a column using a column name. You can also remove a column by using an index. In both cases, you must specify an axis as part of the removal process (normally 1). Here’s the output from this example:
A B C
0 2 1 5
2 1 3 4
A C
0 2 5
2 1 4
Sorting and shuffling are two ends of the same goal — to manage data order. In the first case, you put the data into order, while in the second, you remove any systematic patterning from the order. In general, you don’t sort datasets for the purpose of analysis because doing so can cause you to get incorrect results. However, you might want to sort data for presentation purposes. The following example shows both sorting and shuffling:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [2,1,2,3,3,5,4],
'B': [1,2,3,5,4,2,5],
'C': [5,3,4,1,1,2,3]})
df = df.sort_values(by=['A', 'B'], ascending=[True, True])
df = df.reset_index(drop=True)
print(df)
index = df.index.tolist()
np.random.shuffle(index)
df = df.loc[df.index[index]]
df = df.reset_index(drop=True)
print()
print(df)
It turns out that sorting the data is a bit easier than shuffling it. To sort the data, you use the sort_values()
method and define which columns to use for indexing purposes. You can also determine whether the index is in ascending or descending order. Make sure to always call reset_index()
when you’re done so that the index appears in order for analysis or other purposes.
To shuffle the data, you first acquire the current index using df.index.tolist()
and place it in index
. A call to random.shuffle()
creates a new order for the index. You then apply the new order to df
using loc[]
. As always, you call reset_index()
to finalize the new order. Here’s the output from this example (but note that the second output may not match your output because it has been shuffled):
A B C
0 1 2 3
1 2 1 5
2 2 3 4
3 3 4 1
4 3 5 1
5 4 5 3
6 5 2 2
A B C
0 2 1 5
1 2 3 4
2 3 4 1
3 1 2 3
4 3 5 1
5 4 5 3
6 5 2 2
Aggregation is the process of combining or grouping data together into a set, bag, or list. The data may or may not be alike. However, in most cases, an aggregation function combines several rows together statistically using algorithms such as average, count, maximum, median, minimum, mode, or sum. There are several reasons to aggregate data:
The most important use of data aggregation is to promote anonymity in order to meet legal or other concerns. Sometimes even data that should be anonymous turns out to provide identification of an individual using the proper analysis techniques. For example, researchers have found that it’s possible to identify individuals based on just three credit card purchases (see https://www.computerworld.com/article/2877935/how-three-small-credit-card-transactions-could-reveal-your-identity.html
for details). Here’s an example that shows how to perform aggregation tasks:
import pandas as pd
df = pd.DataFrame({'Map': [0,0,0,1,1,2,2],
'Values': [1,2,3,5,4,2,5]})
df['S'] = df.groupby('Map')['Values'].transform(np.sum)
df['M'] = df.groupby('Map')['Values'].transform(np.mean)
df['V'] = df.groupby('Map')['Values'].transform(np.var)
print(df)
In this case, you have two initial features for this DataFrame
. The values in Map
define which elements in Values
belong together. For example, when calculating a sum for Map
index 0, you use the Values
1
, 2
, and 3
.
To perform the aggregation, you must first call groupby()
to group the Map
values. You then index into Values
and rely on transform()
to create the aggregated data using one of several algorithms found in NumPy, such as np.sum
. Here are the results of this calculation:
Map Values S M V
0 0 1 6 2.0 1.0
1 0 2 6 2.0 1.0
2 0 3 6 2.0 1.0
3 1 5 9 4.5 0.5
4 1 4 9 4.5 0.5
5 2 2 7 3.5 4.5
6 2 5 7 3.5 4.5
3.148.106.149