- Read in the college dataset; the columns that begin with UGDS_ represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:
>>> college = pd.read_csv('data/college.csv', index_col='INSTNM')
>>> college_ugds_ = college.filter(like='UGDS_')
>>> college_ugds_.head()
- Now that the DataFrame contains homogenous column data, operations can be sensibly done both vertically and horizontally. The count method returns the number of non-missing values. By default, its axis parameter is set to 0:
>>> college_ugds_.count()
UGDS_WHITE 6874 UGDS_BLACK 6874 UGDS_HISP 6874 UGDS_ASIAN 6874 UGDS_AIAN 6874 UGDS_NHPI 6874 UGDS_2MOR 6874 UGDS_NRA 6874 UGDS_UNKN 6874
As the axis parameter is almost always set to 0, it is not necessary to do the following, but for purposes of understanding, Step 2 is equivalent to both college_ugds_.count(axis=0) and college_ugds_.count(axis='index').
- Changing the axis parameter to 1/columns transposes the operation so that each row of data has a count of its non-missing values:
>>> college_ugds_.count(axis='columns').head()
INSTNM
Alabama A & M University 9 University of Alabama at Birmingham 9 Amridge University 9 University of Alabama in Huntsville 9 Alabama State University 9
- Instead of counting non-missing values, we can sum all the values in each row. Each row of percentages should add up to 1. The sum method may be used to verify this:
>>> college_ugds_.sum(axis='columns').head()
INSTNM Alabama A & M University 1.0000 University of Alabama at Birmingham 0.9999 Amridge University 1.0000 University of Alabama in Huntsville 1.0000 Alabama State University 1.0000
- To get an idea of the distribution of each column, the median method can be used:
>>> college_ugds_.median(axis='index')
UGDS_WHITE 0.55570 UGDS_BLACK 0.10005 UGDS_HISP 0.07140 UGDS_ASIAN 0.01290 UGDS_AIAN 0.00260 UGDS_NHPI 0.00000 UGDS_2MOR 0.01750 UGDS_NRA 0.00000 UGDS_UNKN 0.01430