How it works...

After importing your dataset, a common task is to print out the first few rows of the DataFrame for manual inspection with the head method. The shape attribute returns the first piece of metadata, a tuple containing the number of rows and columns.

The primary method to get the most metadata at once is the info method. It provides each column name, the number of non-missing values, the data type of each column, and the approximate memory usage of the DataFrame. For all DataFrames, columns values are always one data type. The same holds for relational databases. DataFrames, as a whole, might be composed of columns with different data types.

Internally, pandas stores columns of the same data type together in blocks. For a deeper dive into pandas internals, see Jeff Tratner's slides (http://bit.ly/2xHIv1g).

Step 4 and step 5 produce univariate descriptive statistics on different types of columns. The powerful describe method produces different output based on the data types provided to the include parameter. By default, describe outputs a summary for all the numeric (mostly continuous) columns and silently drops any categorical columns. You may use np.number or the string number to include both integers and floats in the summary. Technically, the data types are part of a hierarchy where number resides above integers and floats. Take a look at the following diagram to understand the NumPy data type hierarchy better:

Broadly speaking, we can classify data as being either continuous or categorical. Continuous data is always numeric and can usually take on an infinite number of possibilities such as height, weight, and salary. Categorical data represents discrete values that take on a finite number of possibilities such as ethnicity, employment status, and car color. Categorical data can be represented numerically or with characters.

Categorical columns are usually going to be either of type np.object or pd.Categorical. Step 5 ensures that both of these types are represented. In both step 4 and step 5, the output DataFrame is transposed with the T attribute. This eases readability for DataFrames with many columns.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...