10

Data Types

Data types determine what can and cannot be done to a variable (i.e., column). For example, when numeric data types are added together, the result will be a sum of the values; in contrast, if strings (in Pandas they are object or string types) are added, the strings will be concatenated together.

This chapter presents a quick overview of the various data types you may encounter in Pandas, and means to convert from one data type to another.

Learning Objectives

  • Recognize columns in data store the same data type

  • Identify what kind of data type is stored in a column

  • Use functions to change the type of a column

  • Modify categorical columns

10.1 Data Types

In this chapter, we’ll use the built-in tips data set from the seaborn library.

import pandas as pd
import seaborn as sns

tips = sns.load_data set("tips")

To get a list of the data types stored in each column of our dataframe, we call the dtypes attribute (Section 1.2).

print(tips.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size            int64
dtype: object

Table 1.1 listed the various types of data that can be stored in a Pandas column. Our data set includes data of types int64, float64, and category. The int64 and float64 types represent numeric values without and with decimal points, respectively. The number following the numeric data type represents the number of bits of information that will be stored for that particular number.

The category data type represents categorical variables. It differs from the generic object data type that stores arbitrary Python objects (usually strings). We will explore these differences later in this chapter. Since the tips data set is a fully prepared and cleaned data set, variables that store strings were saved as a category.

10.2 Converting Types

The data type that is stored in a column will govern which kinds of functions and calculations you can perform on the data found in that column. Clearly, then, it’s important to know how to convert between data types.

This section focuses on how to convert from one data type to another. Keep in mind that you need not do all your data type conversions at once, when you first get your data. Data analytics is not a linear process, and you can choose to convert types on the fly as needed. We saw an example of this in Section 2.4.2, when we converted a date value into just the number of years.

10.2.1 Converting to String Objects

In our tips data, the sex, smoker, day, and time variables are stored as a category. In general, it’s much easier to work with string object types when the variable is not a numeric number. There are performance benefits from using a category data type, however.

Some data sets may have an id column in which the id is stored as a number, but has no meaning if you perform a calculation on it (e.g., if you try to find the mean). Unique identifiers or id numbers are typically coded this way, and you may want to convert them to string object types depending on what you need.

To convert values into strings, we use the .astype() method on the column (i.e., Series).1 The .astype() method takes a parameter, dtype, which will be the new data type the column will take on. In this case, we want to convert the sex variable to a string object, str.

1. Series.astype() method documentation: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.astype.html

# convert the category sex column into a string dtype
tips['sex_str'] = tips['sex'].astype(str)

Python has built-in str, float, int, complex, and bool types. However, you can also specify any dtype from the numpy library. If we look at the dtypes now, you will see the sex_str now has a dtype of object.

print(tips.dtypes)
total_bill       float64
tip              float64
sex             category
smoker          category
day             category
time            category
size               int64
sex_str           object
dtype: object

10.2.2 Converting to Numeric Values

The .astype() method is a generic function that can be used to convert any column in a dataframe to another dtype.

Recall that each column in a DataFrame is a Pandas Series object. That’s why the .astype() documentation is listed under pandas.Series.astype. The example here shows how to change the type of a DataFrame column, but if you are working with a Series object, you can use the same .astype() method to convert the Series as well.

We can provide any built-in or numpy type to the .astype() method to convert the dtype of the column. For example, if we wanted to convert the total_bill column first to a string object and then back to its original float64, we can pass in str and float into astype, respectively.

# convert total_bill into a string
tips['total_bill'] = tips['total_bill'].astype(str)
print(tips.dtypes)
total_bill     object
tip           float64
sex          category
smoker       category
day          category
time         category
size            int64
sex_str        object
dtype: object
# convert it back to a float
tips['total_bill'] = tips['total_bill'].astype(float)
print(tips.dtypes)

total_bill        float64
tip               float64
sex              category
smoker           category
day              category
time             category
size                int64
sex_str            object
dtype: object
10.2.2.1 The .to_numeric() Method

When converting variables into numeric values (e.g., int, float), you can also use the Pandas to_numeric() function, which handles non-numeric values better.

Since each column in a dataframe has to have the same dtype, there will be times when a numeric column contains strings as some of its values. For example, instead of the NaN value that represents a missing value in Pandas, a numeric column might use the string 'missing' or 'null' for this purpose instead. This would make the entire column a string object type instead of a numeric type.

Let’s subset our tips dataframe and also put in a 'missing' value in the total_bill column to illustrate how the to_numeric() function works.

# subset the tips data
tips_sub_miss = tips.head(10).copy()
# assign some 'missing' values
tips_sub_miss.loc[[1, 3, 5, 7], 'total_bill'] = 'missing'
print(tips_sub_miss)
  total_bill  tip    sex smoker day   time size sex_str
0      16.99 1.01 Female     No Sun Dinner    2  Female
1    missing 1.66   Male     No Sun Dinner    3    Male
2      21.01 3.50   Male     No Sun Dinner    3    Male
3    missing 3.31   Male     No Sun Dinner    2    Male
4      24.59 3.61 Female     No Sun Dinner    4  Female
5    missing 4.71   Male     No Sun Dinner    4    Male
6       8.77 2.00   Male     No Sun Dinner    2    Male
7    missing 3.12   Male     No Sun Dinner    4    Male
8      15.04 1.96   Male     No Sun Dinner    2    Male
9      14.78 3.23   Male     No Sun Dinner    2    Male

Looking at the dtypes, you will see that the total_bill column is now a string object.

print(tips_sub_miss.dtypes)
total_bill         object
tip               float64
sex              category
smoker           category
day              category
time             category
size                int64
sex_str            object
dtype: object

If we now try to use the .astype() method to convert the column back to a float, we will get an error: Pandas does not know how to convert 'missing' into a float.

# this will cause an error
tips_sub_miss['total_bill'].astype(float)
ValueError: could not convert string to float: 'missing'

If we use the to_numeric() function from the pandas library, we get a similar error.

# this will cause an error
pd.to_numeric(tips_sub_miss['total_bill'])
ValueError: Unable to parse string "missing" at position 1

The to_numeric() function has a parameter called errors that governs what happens when the function encounters a value that it is unable to convert to a numeric value. By default, this value is set to 'raise'; that is, if it does encounter a value it is unable to convert to a numeric value, it will 'raise' an error.

Based on the documentation:2

2. to_numeric() function documentation: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

  • ‘raise’, then invalid parsing will raise an exception

  • ‘coerce’, then invalid parsing will be set as NaN

  • ‘ignore’, then invalid parsing will return the input

Going out of order from the documentation, if we pass errors the 'ignore' value, nothing will change in our column. But we also do not get an error message, which may not always be the behavior we want.

tips_sub_miss["total_bill"] = pd.to_numeric(
    tips_sub_miss["total_bill"], errors="ignore"
)

print(tips_sub_miss)

  total_bill  tip    sex smoker day   time size sex_str
0      16.99 1.01 Female     No Sun Dinner    2  Female
1    missing 1.66   Male     No Sun Dinner    3    Male
2      21.01 3.50   Male     No Sun Dinner    3    Male
3    missing 3.31   Male     No Sun Dinner    2    Male
4      24.59 3.61 Female     No Sun Dinner    4  Female
5    missing 4.71   Male     No Sun Dinner    4    Male
6       8.77 2.00   Male     No Sun Dinner    2    Male
7    missing 3.12   Male     No Sun Dinner    4    Male
8      15.04 1.96   Male     No Sun Dinner    2    Male
9      14.78 3.23   Male     No Sun Dinner    2    Male
print(tips_sub_miss.dtypes)
total_bill     object
tip           float64
sex          category
smoker       category
day          category
time         category
size            int64
sex_str        object
dtype: object

In contrast, if we pass in the 'coerce' value, we will get NaN values for the 'missing' string.

tips_sub_miss["total_bill"] = pd.to_numeric(
    tips_sub_miss["total_bill"], errors="coerce"
)
print(tips_sub_miss)
   total_bill  tip    sex smoker day   time size sex_str
0       16.99 1.01 Female     No Sun Dinner    2  Female
1         NaN 1.66   Male     No Sun Dinner    3    Male
2       21.01 3.50   Male     No Sun Dinner    3    Male
3         NaN 3.31   Male     No Sun Dinner    2    Male
4       24.59 3.61 Female     No Sun Dinner    4  Female
5         NaN 4.71   Male     No Sun Dinner    4    Male
6        8.77 2.00   Male     No Sun Dinner    2    Male
7         NaN 3.12   Male     No Sun Dinner    4    Male
8       15.04 1.96   Male     No Sun Dinner    2    Male
9       14.78 3.23   Male     No Sun Dinner    2    Male
print(tips_sub_miss.dtypes)
total_bill    float64
tip           float64
sex          category
smoker       category
day          category
time         category
size            int64
sex_str        object
dtype: object

This is a useful trick when you know a column must contain numeric values, but for some reason the data include non-numeric values.

10.3 Categorical Data

Not all data values are numeric. Pandas has a category dtype that can encode categorical values.3 Here are a few use cases for categorical data:

3. Categorical data: https://pandas.pydata.org/docs/user_guide/categorical.html

  • It can be memory and speed efficient to store data in this manner, especially if the data set includes many repeated string values

  • Categorical data may be appropriate when a column of values has an order (e.g., a Likert scale)

  • Some Python libraries understand how to deal with categorical data (e.g., when fitting statistical models)

10.3.1 Convert to Category

To convert a column into a categorical type, we pass category into the .astype() method.

# convert the sex column into a string object first
tips['sex'] = tips['sex'].astype('str')
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
 #  Column     Non-Null Count Dtype
--- ------     -------------- -----
 0  total_bill 244 non-null   float64
 1  tip        244 non-null   float64
 2  sex        244 non-null   object
 3  smoker     244 non-null   category
 4  day        244 non-null   category
 5  time       244 non-null   category
 6  size       244 non-null   int64
 7  sex_str    244 non-null   object
dtypes: category(3), float64(2), int64(1), object(2)
memory usage: 10.8+ KB
None
# convert the sex column back into categorical data
tips['sex'] = tips['sex'].astype('category')
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
 #  Column       Non-Null Count Dtype
--- ------       -------------- -----
 0  total_bill   244 non-null   float64
 1  tip          244 non-null   float64
 2  sex          244 non-null   category
 3  smoker       244 non-null   category
 4  day          244 non-null   category
 5  time         244 non-null   category
 6  size         244 non-null   int64
 7  sex_str      244 non-null   object
dtypes: category(4), float64(2), int64(1), object(1)
memory usage: 9.3+ KB
None

You can also see the difference in memory usage from the string and category storage.

10.3.2 Manipulating Categorical Data

The API reference has a list of which operations can be performed on a categorical Series.4 The .cat. accessor is an attribute that allows you to access the category information in the Series. This list has been reproduced in Table 10.1.

4. The .cat. accessor: https://pandas.pydata.org/docs/reference/series.html#categorical-accessor

Table 10.1 Categorical Accessor Attributes and Methods

Attribute or Method

Description

Series.cat.categories

The categories

Series.cat.ordered

Whether the categories are ordered

Series.cat.codes

Return the integer code of the category

Series.cat.rename_categories()

Rename categories

Series.cat.reorder_categories()

Reorder categories

Series.cat.add_categories()

Add new categories

Series.cat.remove_categories()

Remove categories

Series.cat.remove_unused_categories()

Remove unused categories

Series.cat.set_categories())

Set new categories

Series.cat.as_ordered()

Make the category ordered

Series.cat.as_unordered()

Make the category unordered

Conclusion

This chapter covered how to convert from one data type to another. dtypes govern which operations can and cannot be performed on a column. While this chapter is relatively short, converting types is an important skill when you are working with data and when you are using other Pandas methods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.143.41