Indexing Series and DataFrame objects

Retrieving data from a pd.Series, given its key, can be done intuitively by indexing the pd.Series.loc attribute:

    effective_series.loc["a"]
# Result:
# True

It is also possible to access the elements, given its position in the underlying array, using the pd.Series.iloc attribute:

    effective_series.iloc[0]
# Result:
# True

You can also use the pd.Series.ix attribute for mixed access. If the key is not an integer, it will try to match by key, otherwise it will extract the element at the position indicated by the integer. A similar behavior will take place when you access the pd.Series directly. The following example demonstrates these concepts:

    effective_series.ix["a"] # By key
effective_series.ix[0] # By position

# Equivalent
effective_series["a"] # By key
effective_series[0] # By position

Note that if the index is made of integers, this method will fall back to the key-only method (like loc). To index by position in this scenario, the iloc method is your only option.

Indexing pd.DataFrame works in a similar way. For example, you can use pd.DataFrame.loc to extract a row by key, and you can use pd.DataFrame.iloc to extract a row by position:

    df.loc["a"]
df.iloc[0]
# Result:
# dia_final 70
# dia_initial 75
# sys_final 115
# sys_initial 120
# Name: a, dtype: int64

An important aspect is that the return type in this case is a pd.Series, where each column is a new key. In order to retrieve a specific row and column, you can use the following code. The loc attribute will index both row and column by key, while the iloc version will index row and column by an integer:

    df.loc["a", "sys_initial"] # is equivalent to
df.loc["a"].loc["sys_initial"]

df.iloc[0, 1] # is equivalent to
df.iloc[0].iloc[1]

Indexing a pd.DataFrame using the ix attribute is convenient to mix and match index and location-based indexing. For example, retrieving the "sys_initial" column for the row at position 0 can be accomplished as follows:

    df.ix[0, "sys_initial"] 

Retrieving a column from a pd.DataFrame by name can be achieved by regular indexing or attribute access.  To retrieve a column by position, you can either use iloc or use the pd.DataFrame.column attribute to retrieve the name of the column:

    # Retrieve column by name
df["sys_initial"] # Equivalent to
df.sys_initial

# Retrieve column by position
df[df.columns[2]] # Equivalent to
df.iloc[:, 2]

The mentioned methods also support more advanced indexing similar to those of NumPy, such as bool, lists, and int arrays.

Now it's time for some performance considerations. There are some differences between an index in Pandas and a dictionary. For example, while the keys of a dictionary cannot contain duplicates, Pandas indexes can contain repeated elements. This flexibility, however, comes at a cost--if we try to access an element in a non-unique index, we may incur substantial performance loss--the access will be O(N), like a linear search, rather than O(1), like a dictionary.

A way to mitigate this effect is to sort the index; this will allow Pandas to use a binary search algorithm with a computational complexity of O(log(N)), which is much better. This can be accomplished using the pd.Series.sort_index function, as in the following code (the same applies for pd.DataFrame):

    # Create a series with duplicate index
index = list(range(1000)) + list(range(1000))

# Accessing a normal series is a O(N) operation
series = pd.Series(range(2000), index=index)

# Sorting the will improve look-up scaling to O(log(N))
series.sort_index(inplace=True)

The timings for the different versions are summarized in the following table:

Index type N=10000 N=20000 N=30000 Time
Unique 12.30 12.58 13.30 O(1)
Non unique 494.95 814.10 1129.95 O(N)
Non unique (sorted) 145.93 145.81 145.66 O(log(N))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.7.240