Data types determine what can and cannot be done to a variable (i.e., column). For example, when numeric data types are added together, the result will be a sum of the values; in contrast, if strings (in Pandas they are object
or string
types) are added, the strings will be concatenated together.
This chapter presents a quick overview of the various data types you may encounter in Pandas, and means to convert from one data type to another.
Recognize columns in data store the same data type
Identify what kind of data type is stored in a column
Use functions to change the type of a column
Modify categorical columns
In this chapter, we’ll use the built-in tips
data set from the seaborn
library.
import pandas as pd
import seaborn as sns
tips = sns.load_data set("tips")
To get a list of the data types stored in each column of our dataframe, we call the dtypes
attribute (Section 1.2).
print(tips.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object
Table 1.1 listed the various types of data that can be stored in a Pandas column. Our data set includes data of types int64
, float64
, and category
. The int64
and float64
types represent numeric values without and with decimal points, respectively. The number following the numeric data type represents the number of bits of information that will be stored for that particular number.
The category
data type represents categorical variables. It differs from the generic object
data type that stores arbitrary Python objects (usually strings). We will explore these differences later in this chapter. Since the tips
data set is a fully prepared and cleaned data set, variables that store strings were saved as a category.
The data type that is stored in a column will govern which kinds of functions and calculations you can perform on the data found in that column. Clearly, then, it’s important to know how to convert between data types.
This section focuses on how to convert from one data type to another. Keep in mind that you need not do all your data type conversions at once, when you first get your data. Data analytics is not a linear process, and you can choose to convert types on the fly as needed. We saw an example of this in Section 2.4.2, when we converted a date value into just the number of years.
In our tips
data, the sex
, smoker
, day
, and time
variables are stored as a category
. In general, it’s much easier to work with string object
types when the variable is not a numeric number. There are performance benefits from using a category
data type, however.
Some data sets may have an id
column in which the id
is stored as a number, but has no meaning if you perform a calculation on it (e.g., if you try to find the mean). Unique identifiers or id
numbers are typically coded this way, and you may want to convert them to string object
types depending on what you need.
To convert values into strings, we use the .astype()
method on the column (i.e., Series
).1 The .astype()
method takes a parameter, dtype
, which will be the new data type the column will take on. In this case, we want to convert the sex
variable to a string object
, str
.
1. Series.astype()
method documentation: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.astype.html
# convert the category sex column into a string dtype
tips['sex_str'] = tips['sex'].astype(str)
Python has built-in str
, float
, int
, complex
, and bool
types. However, you can also specify any dtype
from the numpy
library. If we look at the dtypes
now, you will see the sex_str
now has a dtype
of object
.
print(tips.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
The .astype()
method is a generic function that can be used to convert any column in a dataframe to another dtype
.
Recall that each column in a DataFrame
is a Pandas Series
object. That’s why the .astype()
documentation is listed under pandas.Series.astype
. The example here shows how to change the type of a DataFrame
column, but if you are working with a Series
object, you can use the same .astype()
method to convert the Series
as well.
We can provide any built-in or numpy
type to the .astype()
method to convert the dtype
of the column. For example, if we wanted to convert the total_bill
column first to a string object
and then back to its original float64
, we can pass in str
and float
into astype
, respectively.
# convert total_bill into a string
tips['total_bill'] = tips['total_bill'].astype(str)
print(tips.dtypes)
total_bill object
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
# convert it back to a float
tips['total_bill'] = tips['total_bill'].astype(float)
print(tips.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
.to_numeric()
MethodWhen converting variables into numeric values (e.g., int
, float
), you can also use the Pandas to_numeric()
function, which handles non-numeric values better.
Since each column in a dataframe has to have the same dtype
, there will be times when a numeric column contains strings as some of its values. For example, instead of the NaN
value that represents a missing value in Pandas, a numeric column might use the string 'missing'
or 'null'
for this purpose instead. This would make the entire column a string object
type instead of a numeric type.
Let’s subset our tips
dataframe and also put in a 'missing'
value in the total_bill
column to illustrate how the to_numeric()
function works.
# subset the tips data
tips_sub_miss = tips.head(10).copy()
# assign some 'missing' values
tips_sub_miss.loc[[1, 3, 5, 7], 'total_bill'] = 'missing'
print(tips_sub_miss)
total_bill tip sex smoker day time size sex_str
0 16.99 1.01 Female No Sun Dinner 2 Female
1 missing 1.66 Male No Sun Dinner 3 Male
2 21.01 3.50 Male No Sun Dinner 3 Male
3 missing 3.31 Male No Sun Dinner 2 Male
4 24.59 3.61 Female No Sun Dinner 4 Female
5 missing 4.71 Male No Sun Dinner 4 Male
6 8.77 2.00 Male No Sun Dinner 2 Male
7 missing 3.12 Male No Sun Dinner 4 Male
8 15.04 1.96 Male No Sun Dinner 2 Male
9 14.78 3.23 Male No Sun Dinner 2 Male
Looking at the dtypes
, you will see that the total_bill
column is now a string object
.
print(tips_sub_miss.dtypes)
total_bill object
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
If we now try to use the .astype()
method to convert the column back to a float
, we will get an error: Pandas does not know how to convert 'missing'
into a float
.
# this will cause an error
tips_sub_miss['total_bill'].astype(float)
ValueError: could not convert string to float: 'missing'
If we use the to_numeric()
function from the pandas
library, we get a similar error.
# this will cause an error
pd.to_numeric(tips_sub_miss['total_bill'])
ValueError: Unable to parse string "missing" at position 1
The to_numeric()
function has a parameter called errors
that governs what happens when the function encounters a value that it is unable to convert to a numeric value. By default, this value is set to 'raise'
; that is, if it does encounter a value it is unable to convert to a numeric value, it will 'raise'
an error.
Based on the documentation:2
2. to_numeric()
function documentation: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
‘raise’, then invalid parsing will raise an exception
‘coerce’, then invalid parsing will be set as NaN
‘ignore’, then invalid parsing will return the input
Going out of order from the documentation, if we pass errors
the 'ignore'
value, nothing will change in our column. But we also do not get an error message, which may not always be the behavior we want.
tips_sub_miss["total_bill"] = pd.to_numeric(
tips_sub_miss["total_bill"], errors="ignore"
)
print(tips_sub_miss)
total_bill tip sex smoker day time size sex_str
0 16.99 1.01 Female No Sun Dinner 2 Female
1 missing 1.66 Male No Sun Dinner 3 Male
2 21.01 3.50 Male No Sun Dinner 3 Male
3 missing 3.31 Male No Sun Dinner 2 Male
4 24.59 3.61 Female No Sun Dinner 4 Female
5 missing 4.71 Male No Sun Dinner 4 Male
6 8.77 2.00 Male No Sun Dinner 2 Male
7 missing 3.12 Male No Sun Dinner 4 Male
8 15.04 1.96 Male No Sun Dinner 2 Male
9 14.78 3.23 Male No Sun Dinner 2 Male
print(tips_sub_miss.dtypes)
total_bill object
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
In contrast, if we pass in the 'coerce'
value, we will get NaN
values for the 'missing'
string.
tips_sub_miss["total_bill"] = pd.to_numeric(
tips_sub_miss["total_bill"], errors="coerce"
)
print(tips_sub_miss)
total_bill tip sex smoker day time size sex_str
0 16.99 1.01 Female No Sun Dinner 2 Female
1 NaN 1.66 Male No Sun Dinner 3 Male
2 21.01 3.50 Male No Sun Dinner 3 Male
3 NaN 3.31 Male No Sun Dinner 2 Male
4 24.59 3.61 Female No Sun Dinner 4 Female
5 NaN 4.71 Male No Sun Dinner 4 Male
6 8.77 2.00 Male No Sun Dinner 2 Male
7 NaN 3.12 Male No Sun Dinner 4 Male
8 15.04 1.96 Male No Sun Dinner 2 Male
9 14.78 3.23 Male No Sun Dinner 2 Male
print(tips_sub_miss.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
This is a useful trick when you know a column must contain numeric values, but for some reason the data include non-numeric values.
Not all data values are numeric. Pandas has a category dtype
that can encode categorical values.3 Here are a few use cases for categorical data:
3. Categorical data: https://pandas.pydata.org/docs/user_guide/categorical.html
It can be memory and speed efficient to store data in this manner, especially if the data set includes many repeated string values
Categorical data may be appropriate when a column of values has an order (e.g., a Likert scale)
Some Python libraries understand how to deal with categorical data (e.g., when fitting statistical models)
To convert a column into a categorical type, we pass category
into the .astype()
method.
# convert the sex column into a string object first
tips['sex'] = tips['sex'].astype('str')
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
7 sex_str 244 non-null object
dtypes: category(3), float64(2), int64(1), object(2)
memory usage: 10.8+ KB
None
# convert the sex column back into categorical data
tips['sex'] = tips['sex'].astype('category')
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
7 sex_str 244 non-null object
dtypes: category(4), float64(2), int64(1), object(1)
memory usage: 9.3+ KB
None
You can also see the difference in memory usage from the string and category storage.
The API reference has a list of which operations can be performed on a categorical Series
.4 The .cat.
accessor is an attribute that allows you to access the category information in the Series
. This list has been reproduced in Table 10.1.
4. The .cat.
accessor: https://pandas.pydata.org/docs/reference/series.html#categorical-accessor
Table 10.1 Categorical Accessor Attributes and Methods
Attribute or Method | Description |
---|---|
| The categories |
| Whether the categories are ordered |
| Return the integer code of the category |
| Rename categories |
| Reorder categories |
| Add new categories |
| Remove categories |
| Remove unused categories |
| Set new categories |
| Make the category ordered |
| Make the category unordered |
This chapter covered how to convert from one data type to another. dtypes
govern which operations can and cannot be performed on a column. While this chapter is relatively short, converting types is an important skill when you are working with data and when you are using other Pandas methods.
3.147.77.208