There's more...

To get a better idea of how object data type columns differ from integers and floats, a single value from each one of these columns can be modified and the resulting memory usage displayed. The CURROPER and INSTNM columns are of int64 and object types, respectively:

>>> college.loc[0, 'CURROPER'] = 10000000
>>> college.loc[0, 'INSTNM'] = college.loc[0, 'INSTNM'] + 'a'
>>> college[['CURROPER', 'INSTNM']].memory_usage(deep=True)
Index           80
CURROPER     60280
INSTNM      660345

Memory usage for CURROPER remained the same since a 64-bit integer is more than enough space for the larger number. On the other hand, the memory usage for INSTNM increased by 105 bytes by just adding a single letter to one value.

Python 3 uses Unicode, a standardized character representation intended to encode all the world's writing systems. Unicode uses up to 4 bytes per character. It seems that pandas has some overhead (100 bytes) when making the first modification to a character value. Afterward, increments of 5 bytes per character are sustained.

Not all columns can be coerced to the desired type. Take a look at the MENONLY column, which from the data dictionary appears to contain only 0/1 values. The actual data type of this column upon import unexpectedly turns out to be float64. The reason for this is that there happen to be missing values, denoted by np.nan. There is no integer representation for missing values. Any numeric column with even a single missing value must be a float. Furthermore, any column of an integer data type will automatically be coerced to a float if one of the values becomes missing:

>>> college['MENONLY'].dtype
dtype('float64')

>>> college['MENONLY'].astype(np.int8)
ValueError: Cannot convert non-finite values (NA or inf) to integer

Additionally, it is possible to substitute string names in place of Python objects when referring to data types. For instance, when using the include parameter in the describe DataFame method, it is possible to pass a list of either the formal object NumPy/pandas object or their equivalent string representation. These are available in the table at the beginning of the Selecting columns with methods recipe in Chapter 2, Essential DataFrame Operations,. For instance, each of the following produces the same result:

>>> college.describe(include=['int64', 'float64']).T
>>> college.describe(include=[np.int64, np.float64]).T
>>> college.describe(include=['int', 'float']).T 
>>> college.describe(include=['number']).T

These strings can be similarly used when changing types:

>>> college['MENONLY'] = college['MENONLY'].astype('float16')
>>> college['RELAFFIL'] = college['RELAFFIL'].astype('int8')

The equivalence of a string and the outright pandas or NumPy object occurs elsewhere in the pandas library and can be a source of confusion as there are two different ways to access the same thing.

Lastly, it is possible to see the enormous memory difference between the minimal RangeIndex and Int64Index, which stores every row index in memory:

>>> college.index = pd.Int64Index(college.index)
>>> college.index.memory_usage() # previously was just 80
60280

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...