This chapter will explore common preprocessing steps using this data:
>>>
X2
=
pd
.
DataFrame
(
...
{
...
"a"
:
range
(
5
),
...
"b"
:
[
-
100
,
-
50
,
0
,
200
,
1000
],
...
}
...
)
>>>
X2
a b
0 0 -100
1 1 -50
2 2 0
3 3 200
4 4 1000
Some algorithms, such as SVM, perform better when the data is standardized. Each
column should have a mean value of 0 and standard deviation of 1.
Sklearn provides a .fit_transform
method that combines both .fit
and
.transform
:
>>>
from
sklearn
import
preprocessing
>>>
std
=
preprocessing
.
StandardScaler
()
>>>
std
.
fit_transform
(
X2
)
array([[-1.41421356, -0.75995002],
[-0.70710678, -0.63737744],
[ 0. , -0.51480485],
[ 0.70710678, -0.02451452],
[ 1.41421356, 1.93664683]])
After fitting, there are various attributes we can inspect:
>>>
std
.
scale_
array([ 1.41421356, 407.92156109])
>>>
std
.
mean_
array([ 2., 210.])
>>>
std
.
var_
array([2.000e+00, 1.664e+05])
Here is a pandas version. Remember that you will need to track the original mean and standard deviation if you use this for preprocessing. Any sample that you will use to predict later will need to be standardized with those same values:
>>>
X_std
=
(
X2
-
X2
.
mean
())
/
X2
.
std
()
>>>
X_std
a b
0 -1.264911 -0.679720
1 -0.632456 -0.570088
2 0.000000 -0.460455
3 0.632456 -0.021926
4 1.264911 1.732190
>>>
X_std
.
mean
()
a 4.440892e-17
b 0.000000e+00
dtype: float64
>>>
X_std
.
std
()
a 1.0
b 1.0
dtype: float64
The fastai library also implements this:
>>>
X3
=
X2
.
copy
()
>>>
from
fastai.structured
import
scale_vars
>>>
scale_vars
(
X3
,
mapper
=
None
)
>>>
X3
.
std
()
a 1.118034
b 1.118034
dtype: float64
>>>
X3
.
mean
()
a 0.000000e+00
b 4.440892e-17
dtype: float64
Scaling to range is translating data so it is between 0 and 1, inclusive. Having the data bounded may be useful. However, if you have outliers, you probably want to be careful using this:
>>>
from
sklearn
import
preprocessing
>>>
mms
=
preprocessing
.
MinMaxScaler
()
>>>
mms
.
fit
(
X2
)
>>>
mms
.
transform
(
X2
)
array([[0. , 0. ],
[0.25 , 0.04545],
[0.5 , 0.09091],
[0.75 , 0.27273],
[1. , 1. ]])
>>>
(
X2
-
X2
.
min
())
/
(
X2
.
max
()
-
X2
.
min
())
a b
0 0.00 0.000000
1 0.25 0.045455
2 0.50 0.090909
3 0.75 0.272727
4 1.00 1.000000
We can use pandas to create dummy variables from categorical data.
This is also referred to as one-hot encoding, or indicator encoding.
Dummy variables are especially useful if the data is nominal (unordered). The
get_dummies
function in pandas creates multiple columns for a categorical column, each with a
1 or 0 if the original column had that value:
>>>
X_cat
=
pd
.
DataFrame
(
...
{
...
"name"
:
[
"George"
,
"Paul"
],
...
"inst"
:
[
"Bass"
,
"Guitar"
],
...
}
...
)
>>>
X_cat
name inst
0 George Bass
1 Paul Guitar
Here is the pandas version. Note the drop_first
option
can be used to eliminate a column (one of the dummy columns
is a linear combination of the other columns):
>>>
pd
.
get_dummies
(
X_cat
,
drop_first
=
True
)
name_Paul inst_Guitar
0 0 0
1 1 1
The pyjanitor library also has the ability to split columns with the
expand_column
function:
>>>
X_cat2
=
pd
.
DataFrame
(
...
{
...
"A"
:
[
1
,
None
,
3
],
...
"names"
:
[
...
"Fred,George"
,
...
"George"
,
...
"John,Paul"
,
...
],
...
}
...
)
>>>
jn
.
expand_column
(
X_cat2
,
"names"
,
sep
=
","
)
A names Fred George John Paul
0 1.0 Fred,George 1 1 0 0
1 NaN George 0 1 0 0
2 3.0 John,Paul 0 0 1 1
If we have high cardinality nominal data, we can use label encoding. This is introduced in the next section.
An alternative to dummy variable encoding is label encoding. This will take categorical data and assign each value a number. It is useful for high cardinality data. This encoder imposes ordinality, which may or may not be desired. It can take up less space than one-hot encoding, and some (tree) algorithms can deal with this encoding.
The label encoder can only deal with one column at a time:
>>>
from
sklearn
import
preprocessing
>>>
lab
=
preprocessing
.
LabelEncoder
()
>>>
lab
.
fit_transform
(
X_cat
)
array([0,1])
If you have encoded values, applying the .inverse_transform
method decodes them:
>>>
lab
.
inverse_transform
([
1
,
1
,
0
])
array(['Guitar', 'Guitar', 'Bass'], dtype=object)
You can also use pandas to label encode. First, you convert the column to a categorical column type, and then pull out the numeric code from it.
This code will create a new series of numeric data from a pandas
series. We use the .as_ordered
method to ensure that the
category is ordered:
>>>
X_cat
.
name
.
astype
(
...
"category"
...
)
.
cat
.
as_ordered
()
.
cat
.
codes
+
1
0 1
1 2
dtype: int8
Another option for handling high cardinality categorical data is to
frequency encode it. This means replacing the name of the category
with the count it had in the training data.
We will use pandas to do this. First, we will use the pandas .value_counts
method to make a mapping (a pandas series that maps strings to counts).
With the mapping we can use the .map
method to do the encoding:
>>>
mapping
=
X_cat
.
name
.
value_counts
()
>>>
X_cat
.
name
.
map
(
mapping
)
0 1
1 1
Name: name, dtype: int64
Make sure you store the training mapping so you can encode future data with the same data.
One way to increase the accuracy of the Titanic model is to pull out titles
from the names. A quick hack to find the most common triples is to
use the Counter
class:
>>>
from
collections
import
Counter
>>>
c
=
Counter
()
>>>
def
triples
(
val
):
...
for
i
in
range
(
len
(
val
)):
...
c
[
val
[
i
:
i
+
3
]]
+=
1
>>>
df
.
name
.
apply
(
triples
)
>>>
c
.
most_common
(
10
)
[(', M', 1282),
(' Mr', 954),
('r. ', 830),
('Mr.', 757),
('s. ', 460),
('n, ', 320),
(' Mi', 283),
('iss', 261),
('ss.', 261),
('Mis', 260)]
We can see that “Mr.” and “Miss.” are very common.
Another option is to use a regular expression to pull out the capital letter followed by lowercase letters and a period:
>>>
df
.
name
.
str
.
extract
(
...
"([A-Za-z]+)."
,
expand
=
False
...
)
.
head
()
0 Miss
1 Master
2 Miss
3 Mr
4 Mrs
Name: name, dtype: object
We can use .value_counts
to see the frequency of these:
>>>
df
.
name
.
str
.
extract
(
...
"([A-Za-z]+)."
,
expand
=
False
...
)
.
value_counts
()
Mr 757
Miss 260
Mrs 197
Master 61
Dr 8
Rev 8
Col 4
Mlle 2
Ms 2
Major 2
Dona 1
Don 1
Lady 1
Countess 1
Capt 1
Sir 1
Mme 1
Jonkheer 1
Name: name, dtype: int64
A complete discussion of regular expressions is beyond the scope of this book. This expression captures a group with one or more alphabetic characters. This group will be followed by a period.
Using these manipulations and pandas, you can create dummy variables or combine columns with low counts into other categories (or drop them).
The categorical_encoding library is a set of scikit-learn transformers used to convert categorical data into numeric data. A nice feature of this library is that it supports outputting pandas DataFrames (unlike scikit-learn, which transforms them into numpy arrays).
One algorithm implemented in the library is a hash encoder. This is useful if you
don’t know how many categories you have ahead of time or if you are using a bag
of words to represent text. This will hash the categorical columns into n_components
. If you are using online learning (models that can be updated), this can be very useful:
>>>
import
category_encoders
as
ce
>>>
he
=
ce
.
HashingEncoder
(
verbose
=
1
)
>>>
he
.
fit_transform
(
X_cat
)
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7
0 0 0 0 1 0 1 0 0
1 0 2 0 0 0 0 0 0
The ordinal encoder can convert categorical columns that have order to a
single column of numbers. Here we convert the size column to ordinal
numbers. If a value is missing from the mapping dictionary, the default
value of -1
is used:
>>>
size_df
=
pd
.
DataFrame
(
...
{
...
"name"
:
[
"Fred"
,
"John"
,
"Matt"
],
...
"size"
:
[
"small"
,
"med"
,
"xxl"
],
...
}
...
)
>>>
ore
=
ce
.
OrdinalEncoder
(
...
mapping
=
[
...
{
...
"col"
:
"size"
,
...
"mapping"
:
{
...
"small"
:
1
,
...
"med"
:
2
,
...
"lg"
:
3
,
...
},
...
}
...
]
...
)
>>>
ore
.
fit_transform
(
size_df
)
name size
0 Fred 1.0
1 John 2.0
2 Matt -1.0
This reference explains many of the algorithms of the categorical_encoding library.
If you have high cardinality data (a large number of unique values) consider using one of the Bayesian encoders that output a single column per categorical column. These are TargetEncoder
, LeaveOneOutEncoder
, WOEEncoder
, JamesSteinEncoder
, and MEstimateEncoder
.
For example, to convert the Titanic survival column to a blend of posterior probability of the target and the prior probability given the title (categorical) information, use the following code:
>>>
def
get_title
(
df
):
...
return
df
.
name
.
str
.
extract
(
...
"([A-Za-z]+)."
,
expand
=
False
...
)
>>>
te
=
ce
.
TargetEncoder
(
cols
=
"Title"
)
>>>
te
.
fit_transform
(
...
df
.
assign
(
Title
=
get_title
),
df
.
survived
...
)[
"Title"
]
.
head
()
0 0.676923
1 0.508197
2 0.676923
3 0.162483
4 0.786802
Name: Title, dtype: float64
The fastai library has an add_datepart
function that will generate date
attribute columns based on a datetime column. This is useful as most
machine learning algorithms would not be able to infer this type of
signal from a numeric representation of a date:
>>>
from
fastai.tabular.transform
import
(
...
add_datepart
,
...
)
>>>
dates
=
pd
.
DataFrame
(
...
{
...
"A"
:
pd
.
to_datetime
(
...
[
"9/17/2001"
,
"Jan 1, 2002"
]
...
)
...
}
...
)
>>>
add_datepart
(
dates
,
"A"
)
>>>
dates
.
T
0 1
AYear 2001 2002
AMonth 9 1
AWeek 38 1
ADay 17 1
ADayofweek 0 1
ADayofyear 260 1
AIs_month_end False False
AIs_month_start False True
AIs_quarter_end False False
AIs_quarter_start False True
AIs_year_end False False
AIs_year_start False True
AElapsed 1000684800 1009843200
add_datepart
mutates the DataFrame, which pandas can do, but normally
doesn’t!
The fastai library used to have a function for creating a column to fill a missing value (with the median) and indicate that a value was missing. There might be some signal in knowing that a value was missing. Here is a copy of the function and an example using it:
>>>
from
pandas.api.types
import
is_numeric_dtype
>>>
def
fix_missing
(
df
,
col
,
name
,
na_dict
):
...
if
is_numeric_dtype
(
col
):
...
if
pd
.
isnull
(
col
)
.
sum
()
or
(
...
name
in
na_dict
...
):
...
df
[
name
+
"_na"
]
=
pd
.
isnull
(
col
)
...
filler
=
(
...
na_dict
[
name
]
...
if
name
in
na_dict
...
else
col
.
median
()
...
)
...
df
[
name
]
=
col
.
fillna
(
filler
)
...
na_dict
[
name
]
=
filler
...
return
na_dict
>>>
data
=
pd
.
DataFrame
({
"A"
:
[
0
,
None
,
5
,
100
]})
>>>
fix_missing
(
data
,
data
.
A
,
"A"
,
{})
{'A': 5.0}
>>>
data
A A_na
0 0.0 False
1 5.0 True
2 5.0 False
3 100.0 False
Here is a pandas version:
>>>
data
=
pd
.
DataFrame
({
"A"
:
[
0
,
None
,
5
,
100
]})
>>>
data
[
"A_na"
]
=
data
.
A
.
isnull
()
>>>
data
[
"A"
]
=
data
.
A
.
fillna
(
data
.
A
.
median
())
We can use pandas to generate new features. For the Titanic dataset,
we can add aggregate cabin data (maximum age per cabin, mean age per cabin, etc.).
To get aggregate data per
cabin and merge it back in, use the pandas .groupby
method to create the data.
Then align it back to the original data using the .merge
method:
>>>
agg
=
(
...
df
.
groupby
(
"cabin"
)
...
.
agg
(
"min,max,mean,sum"
.
split
(
","
))
...
.
reset_index
()
...
)
>>>
agg
.
columns
=
[
...
"_"
.
join
(
c
)
.
strip
(
"_"
)
...
for
c
in
agg
.
columns
.
values
...
]
>>>
agg_df
=
df
.
merge
(
agg
,
on
=
"cabin"
)
If you wanted to sum up “good” or “bad” columns, you could create a new column that is the sum of the aggregated columns (or another mathematical operation). This is somewhat of an art and also requires understanding the data.
18.117.253.163