How it works...

Pandas defaults integer and float data types to 64 bits regardless of the maximum necessary size for the particular DataFrame. Integers, floats, and even booleans may be coerced to a different data type with the astype method and passing it the exact type, either as a string or specific object, as done in step 4.

The RELAFFIL column is a good choice to cast to a smaller integer type as the data dictionary explains that its values must be 0/1. The memory for RELAFFIL is now an eighth of CURROPER, which remains as its former type.

The memory units displayed are in bytes and not bits. One byte is equivalent to 8 bits, so when RELAFFIL was changed to an 8-bit integer, it uses one 1 byte of memory and as there are 7,535 rows, its memory footprint is equivalent to 7,535 bytes.

Columns that are object data type, such as INSTNM, are not like the other pandas data types. For all the other pandas data types, each value in that column is the same data type. For instance, when a column has the int64 type, every individual column value is also int64. This is not true for columns that are object data type. Each individual column value can be of any type. Object data types can have a mix of strings, numerics, datetimes, or even other Python objects such as lists or tuples. For this reason, the object data type is sometimes referred to as a catch-all for a column of data that doesn't match any of the other data types. The vast majority of the time, though, object data type columns will all be strings.

Relational database management systems such as Microsoft's SQL Server or PostgreSQL have specific data types for characters such as varchar, text, or nchar that also usually specify a maximum number of characters. Pandas object data type is a much broader data type. Every value in an object column can be of any data type.

Therefore, the memory of each individual value in an object data type column is inconsistent. There is no predefined amount of memory for each value like the other data types. For pandas to extract the exact amount of memory of an object data type column, the deep parameter must be set to True in the memory_usage method.

Object columns are targets for the largest memory savings. Pandas has an additional categorical data type that is not available in NumPy. When converting to category, pandas internally creates a mapping from integers to each unique string value. Thus, each string only needs to be kept a single time in memory. As you can see, this simple change of data type reduced memory usage by 97%.

You might also have noticed that the index uses an extremely low amount of memory. If no index is specified during DataFrame creation, as is the case in this recipe, pandas defaults the index to a RangeIndex. The RangeIndex is very similar to the built-in range function. It produces values on demand and only stores the minimum amount of information needed to create an index.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...