How to do it...

  1. Read in the college dataset, create a separate DataFrame with STABBR as the index, and check whether the index is sorted:
>>> college = pd.read_csv('data/college.csv')
>>> college2 = college.set_index('STABBR')
>>> college2.index.is_monotonic
False
  1. Sort the index from college2 and store it as another object:
>>> college3 = college2.sort_index()
>>> college3.index.is_monotonic
True
  1. Time the selection of the state of Texas (TX) from all three DataFrames:
>>> %timeit college[college['STABBR'] == 'TX']
1.43 ms ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit college2.loc['TX']
526 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit college3.loc['TX']
183 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  1. The sorted index performs nearly an order of magnitude faster than boolean selection. Let's now turn towards unique indexes. For this, we use the institution name as the index:
>>> college_unique = college.set_index('INSTNM')
>>> college_unique.index.is_unique
True
  1. Let's select Stanford University with boolean indexing: 
>>> college[college['INSTNM'] == 'Stanford University']
  1. Let's select Stanford University with index selection:
>>> college_unique.loc['Stanford University']
CITY Stanford STABBR CA HBCU 0
...
UG25ABV 0.0401 MD_EARN_WNE_P10 86000 GRAD_DEBT_MDN_SUPP 12782 Name: Stanford University, dtype: object
  1. They both produce the same data, just with different objects. Let's time each approach:
>>> %timeit college[college['INSTNM'] == 'Stanford University']
1.3 ms ± 56.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit college_unique.loc['Stanford University']
157 µs ± 682 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.32.67