How to do it...

  1. Load in the college dataset and execute the same operations as the previous recipe to get only the numeric columns that are of interest:
>>> college = pd.read_csv('data/college.csv', index_col='INSTNM')
>>> cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP']

>>> for col in cols:
college[col] = pd.to_numeric(college[col], errors='coerce')

>>> college_n = college.select_dtypes(include=[np.number])
>>> criteria = college_n.nunique() == 2
>>> binary_cols = college_n.columns[criteria].tolist()
>>> college_n = college_n.drop(labels=binary_cols, axis='columns')
  1. Find the maximum of each column with the max method:
>>> college_n.max().head()
SATVRMID 765.0 SATMTMID 785.0 UGDS 151558.0 UGDS_WHITE 1.0 UGDS_BLACK 1.0 dtype: float64
  1. Use the eq DataFrame method to test each value with its column max. By default, the eq method aligns the columns of the column DataFrame with the labels of the passed Series index:
>>> college_n.eq(college_n.max()).head()
  1. All the rows in this DataFrame that have at least one True value must contain a column maximum. Let's use the any method to find all such rows that have at least one True value:
>>> has_row_max = college_n.eq(college_n.max()).any(axis='columns')
>>> has_row_max.head()
INSTNM Alabama A & M University False University of Alabama at Birmingham False Amridge University False University of Alabama in Huntsville False Alabama State University False dtype: bool
  1. There are only 18 columns, which means that there should only be at most 18 True values in has_row_max. Let's find out how many there actually are:
>>> college_n.shape
(7535, 18)

>>> has_row_max.sum()
401
  1. This was a bit unexpected, but it turns out that there are columns with many rows that equal the maximum value. This is common with many of the percentage columns that have a maximum of 1. idxmax returns the first occurrence of the maximum value. Let's back up a bit, remove the any method, and look at the output from step 3. Let's run the cumsum method instead to accumulate all the True values. The first and last three rows are shown:
>>> college_n.eq(college_n.max()).cumsum()
  1. Some columns have one unique maximum like SATVRMID and SATMTMID, while others like UGDS_WHITE have many. 109 schools have 100% of their undergraduates as white. If we chain the cumsum method one more time, the value 1 would only appear once in each column and it would be the first occurrence of the maximum:
>>> college_n.eq(college_n.max()).cumsum().cumsum()
  1. We can now test the equality of each value against 1 with the eq method and then use the any method to find rows that have at least one True value:
>>> has_row_max2 = college_n.eq(college_n.max()) 
.cumsum()
.cumsum()
.eq(1)
.any(axis='columns')
>>> has_row_max2.head()
INSTNM Alabama A & M University False University of Alabama at Birmingham False Amridge University False University of Alabama in Huntsville False Alabama State University False dtype: bool
  1. Test that has_row_max2 has no more True values than the number of columns:
>>> has_row_max2.sum()
16
  1. We need all the institutions where has_row_max2 is True. We can simply use boolean indexing on the Series itself:
>>> idxmax_cols = has_row_max2[has_row_max2].index
>>> idxmax_cols
Index(['Thunderbird School of Global Management', 'Southwest University of Visual Arts-Tucson',
'ABC Beauty College Inc', 'Velvatex College of Beauty Culture', 'California Institute of Technology', 'Le Cordon Bleu College of Culinary Arts-San Francisco', 'MTI Business College Inc', 'Dongguk University-Los Angeles', 'Mr Leon's School of Hair Design-Moscow', 'Haskell Indian Nations University', 'LIU Brentwood', 'Medical College of Wisconsin', 'Palau Community College', 'California University of Management and Sciences', 'Cosmopolitan Beauty and Tech School',
'University of Phoenix-Arizona'], dtype='object', name='INSTNM')
  1. All 16 of these institutions are the index of the first maximum occurrence for at least one of the columns. We can check whether they are the same as the ones found with the idxmax method:
>>> set(college_n.idxmax().unique()) == set(idxmax_cols)
True
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.88.110