9. Apply

9.1 Introduction

Learning about apply is fundamental in the data cleaning process. It also encapsulates key concepts in programming, mainly writing functions. apply takes a function and “applies” (i.e., runs it) across each row or column of a dataframe “simultaneously.” If you’ve programmed before, then the concept of an “apply” should be familiar. It is similar to writing a for loop across each row or column and calling the function—apply just does it simultaneously. In general, this is the preferred way to apply functions across dataframes, because it typically is much faster than writing a for loop in Python.

Objectives

This chapter will cover:

1. Functions

2. Applying functions across columns or rows of data

9.2 Functions

Functions are core elements of writing apply statements. There’s a lot more information about functions in Appendix O, but here’s a quick introduction.

Functions are a way to group and reuse Python code. If you are ever in a situation where you are copying/pasting code and changing a few parts of the code, then chances are the copied code can be written into a function. To create a function, we need to “define” it. The basic function skeleton looks like this:

def my_function():
    # indent 4 spaces
    # function code here

Since Pandas is used for data analysis, let’s write some more “useful” functions: one that squares a given value and another that takes two numbers and calculates their average.

def my_sq(x):
    """Squares a given value

    """
    return x ** 2

def avg_2(x, y):
    """Calculates the average of 2 numbers
    """

    return (x + y) / 2

The text within the triple quotes is a docstring. It is the text that appears when you look up the help documentation about a function. You can such docstrings to create your own documentation for functions you write as well.

We’ve been using functions throughout this book. If we want to use functions that we’ve created ourselves, we can call them just like functions we’ve loaded from a library.

print(my_sq(4))

16

print(avg_2(10, 20))

15.0

9.3 Apply (Basics)

Now that we know how to write a function, how would we use them in Pandas? When working with dataframes, it’s more likely that you want to use a function across rows or columns of your data.

Here’s a mock dataframe of two columns.

import pandas as pd

df = pd.DataFrame({'a': [10, 20, 30],
               'b': [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40

We can apply our functions over a Series (i.e., an individual column or row).

For didactic purposes, let’s use the function we wrote to square the 'a' column. In this overly simplified example, we could have directly squared the column.

print(df['a'] ** 2)

0   100
1   400
2   900
Name: a, dtype: int64

Of course, that would not allow us to use a function we wrote ourselves.

9.3.1 Apply Over a Series

In our example, if we subset a single column or row, the type of the object we get back is a Pandas Series.

# get the first column
print(type(df['a']))

<class 'pandas.core.series.Series'>

# get the first row
print(type(df.iloc[0]))

<class 'pandas.core.series.Series'>

The Series has a method called apply.1 To use the apply method, we pass the function we want to use across each element in the Series.

For example, if we wanted to square each value in column a, we can do the following:

# apply our square function on the 'a' column
sq = df['a'].apply(my_sq)
print(sq)

0   100
1   400
2   900
Name: a, dtype: int64

Note we do not need the round brackets, (), when we pass the function into apply. Let’s build on this example by writing a function that takes two parameters. The first parameter will be a value, and the second parameter will be the exponent to raise the value to. So far in our my_sq function, we’ve “hard-coded” the exponent, 2, to raise our value.

def my_exp(x, e):
    return x ** e

Now if we want to use our function, we have to provide two parameters to it.

cb = my_exp(2, 3)
print(cb)

8

However, if we want to apply the function on our series, we will need to pass in the second parameter. To do this, we simply pass in the second argument as a keyword argument to apply:

ex = df['a'].apply(my_exp, e=2)
print(ex)

1. Series apply documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply

0   100
1   400
2   900
Name: a, dtype: int64

ex = df['a'].apply(my_exp, e=3)
print(ex)

0    1000
1    8000
2   27000
Name: a, dtype: int64

9.3.2 Apply Over a DataFrame

Now that we’ve seen how to apply functions over a one-dimensional Series, let’s see how the syntax changes when we are working with DataFrames. Here is the example DataFrame from earlier:

df = pd.DataFrame({'a': [10, 20, 30],
                   'b': [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40

DataFrames typically have at least two dimensions. Thus, when we apply a function over a dataframe, we first need to specify which axis to apply the function over—for example, column-by-column or row-by-row.

Let’s first write a function that takes a single value and prints out the given value.

def print_me(x):
    print(x)

Let’s apply this function on our dataframe, The syntax is similar to using the apply method on a Series, but this time we need to specify whether we want the function to be applied column-wise or row-wise.

If we want the function to work column-wise, we can pass the axis=0 parameter into apply. If we want the function to work row-wise, we can pass the axis=1 parameter into apply.

9.3.2.1 Column-wise Operations

Use the axis=0 parameter (the default value) in apply when working with functions in a column-wise manner.

df.apply(print_me, axis=0)

0   10
1   20

2    30
Name: a, dtype: int64
0    20
1    30
2    40
Name: b, dtype: int64

a   None
b   None
dtype: object

Compare this output to the following:

print(df['a'])

0   10
1   20
2   30
Name: a, dtype: int64

print(df['b'])

0   20
1   30
2   40
Name: b, dtype: int64

You can see that the outputs are exactly the same. When you apply a function across a DataFrame (in this case, column-wise with axis=0), the entire axis (e.g., column) is passed into the first argument of the function. To illustrate this further, let’s write a function that calculates the mean (average) of three numbers (each column in our data set contains values).

def avg_3(x, y, z):
     return (x + y + z) / 3

If we try to apply this function across our columns, we get an error.

# will cause an error
print(df.apply(avg_3))

Traceback (most recent call last):
  File "<ipython-input-1-5ebf32ddae32>", line 2, in <module>
    print(df.apply(avg_3))
TypeError: ("avg_3() missing 2 required positional arguments: 'y' and
'z'", 'occurred at index a')

From the (last line of the) error message, you can see that the function takes three arguments, but we failed to pass in the y and z (i.e., the second and third) arguments. Again, when we use apply, the entire column is passed into the first argument. For this function to work with the apply method, we will have to rewrite parts of it.

def avg_3_apply(col):
     x = col[0]
     y = col[1]
     z = col[2]
     return (x + y + z) / 3

print(df.apply(avg_3_apply))

a   20.0
b   30.0
dtype: float64

9.3.2.2 Row-wise Operations

Row-wise operations work just like column-wise operations. The part that differs is the axis. We will now use axis=1 in the apply method. Thus, instead of the entire column being passed into the first argument of the function, the entire row is used as the first argument.

Since our example dataframe has two columns and three rows, the avg_3_apply function we just wrote will not work for row-wise operations.

# will cause an error
print(df.apply(avg_3_apply, axis=1))

Traceback (most recent call last):
  File "/home/dchen/anaconda3/envs/book36/lib/python3.6/sitepackages/
pandas/core/indexes/base.py", line 2477, in get_value
     tz=getattr(series.dtype, 'tz', None))
KeyError: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<ipython-input-1-8e6ba41f3975>", line 2, in <module>
    print(df.apply(avg_3_apply, axis=1))
IndexError: ('index out of bounds', 'occurred at index 0')

The main issue here is the 'index out of bounds'. We passed the row of data in as the first argument, but in our function we begin indexing out of range (i.e., we have only two values in each row, but we tried to get index 2, which means the third element, and it does not exist). If we wanted to calculate our averages row-wise, we would have to write a new function.

def avg_2_apply(row):
   x = row[0]
   y = row[1]
   return (x + y) / 2

print(df.apply(avg_2_apply, axis=0))

a   15.0
b   25.0
dtype: float64

9.4 Apply (More Advanced)

The previous examples used a small toy data set to illustrate how apply works. We saw that you can create a function that can be tested before changing it into something to be used for apply, by writing the function that takes as many inputs as you need; converting it into a function that takes one parameter, the entire row, or the entire column; and then subsetting the components within the function body. Section 9.5 shows another way to get an existing function to work with apply, but for now, let’s use a more realistic example.

The seaborn library has a built-in titanic data set. It contains data about whether an individual survived the sinking of the Titanic.

import seaborn as sns

titanic = sns.load_dataset("titanic")

As we would with any new data set, let’s look at some basic characteristics by using info.

print(titanic.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived           891 non-null int64
pclass             891 non-null int64
sex                891 non-null object
age                714 non-null float64
sibsp              891 non-null int64
parch              891 non-null int64
fare               891 non-null float64
embarked           889 non-null object
class              891 non-null category
who                891 non-null object
adult_male         891 non-null bool
deck               203 non-null category
embark_town        889 non-null object
alive              891 non-null object
alone              891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
None

This data set has 891 rows and 15 columns. Almost all of the cells have a value in them. Of the 891 values, age has 714 complete cases, and deck has 203 complete cases. One way we can use apply is to calculate how many null or NaN values there are in our data, as well as the percentage of complete cases across each column or across each row. Let’s write a few functions.

1. Number of missing values

# we'll use the numpy sum function
import numpy as np

def count_missing(vec):
    """Counts the number of missing values in a vector
    """
    # get a vector of True/False values
    # depending whether the value is missing
    null_vec = pd.isnull(vec)

    # take the sum of the null_vec
    # since null values do not contribute to the sum
    null_count = np.sum(null_vec)

    # return the number of missing values in the vector
    return null_count

2. Proportion of missing values

def prop_missing(vec):
    """Percentage of missing values in a vector
    """
    # numerator: number of missing values
    # we can use the count_missing function we just wrote!
    num = count_missing(vec)

    # denominator: total number of values in the vector
    # we also need to count the missing values
    dem = vec.size

    # return the proportion/percentage of missing
    return num / dem

3. Proportion of complete values

def prop_complete(vec):
    """Percentage of nonmissing values in a vector
    """

    # we can utilize the percent_missing function we just wrote
    # by subtracting its value from 1
    return 1 - prop_missing(vec)

The beauty of many (if not all) of the functions from numpy and Pandas is that they work on vectors. Unlike with our original set of functions, which calculated the mean of two or three values, we can pass in an arbitrary number of items into pd.isnull or np.sum and the function will calculate the corresponding value. These “vectorized” functions (Section 9.5) work across a “vector” and can handle any arbitrary amount of information.

9.4.1 Column-wise Operations

Let’s use our newly created functions on each column of our data.

cmis_col = titanic.apply(count_missing)

pmis_col = titanic.apply(prop_missing)

pcom_col = titanic.apply(prop_complete)

print(cmis_col)

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

print(pmis_col)

survived       0.000000
pclass         0.000000
sex            0.000000
age            0.198653
sibsp          0.000000
parch          0.000000
fare           0.000000
embarked       0.002245
class          0.000000
who            0.000000
adult_male     0.000000
deck           0.772166
embark_town    0.002245
alive          0.000000
alone          0.000000
dtype: float64

print(pcom_col)

survived       1.000000
pclass         1.000000
sex            1.000000
age            0.801347
sibsp          1.000000
parch          1.000000
fare           1.000000
embarked       0.997755
class          1.000000
who            1.000000
adult_male     1.000000
deck           0.227834
embark_town    0.997755
alive          1.000000
alone          1.000000
dtype: float64

What can we do with this information? Since we have counts of missing values, we can determine whether a column is a viable option for use in an analysis. For example, there are only two missing values in the embark_town column. We can easily check those rows to see if these values are missing randomly, or if there is a special reason for them to be missing.

print(titanic.loc[pd.isnull(titanic.embark_town), :])

   survived   pclass     sex    age  sibsp  parch  fare embarked  
61        1        1  female   38.0      0      0  80.0      NaN
829       1        1  female   62.0      0      0  80.0      NaN



    class    who   adult_male  deck  embark_town alive   alone
61  First  woman        False     B          NaN   yes    True
829 First  woman        False     B          NaN   yes    True

Another observation is that the deck variable has 688 (77.2%) of its values missing. Barring further investigation, it’s safe to say this is a variable we would not use for an analysis.

9.4.2 Row-wise Operations

Since our functions are vectorized, we can apply them across the rows of our data without changing them.

cmis_row = titanic.apply(count_missing, axis=1)

pmis_row = titanic.apply(prop_missing, axis=1)

pcom_row = titanic.apply(prop_complete, axis=1)

print(cmis_row.head())

0   1
1   0
2   1
3   0
4   1
dtype: int64

print(pmis_row.head())

0   0.066667
1   0.000000
2   0.066667
3   0.000000
4   0.066667
dtype: float64

print(pcom_row.head())

0   0.933333
1   1.000000
2   0.933333
3   1.000000
4   0.933333
dtype: float64

One thing we can do with this analysis is to see if we have any rows in our data that have multiple missing values.

print(cmis_row.value_counts())

1   549
0   182
2   160
dtype: int64

Since we are using apply in a row-wise manner, we can actually create a new column containing these values.

titanic['num_missing'] = titanic.apply(count_missing, axis=1)

print(titanic.head())

   survived pclass     sex   age  sibsp  parch     fare  embarked  
0         0      3    male  22.0      1      0   7.2500         S
1         1      1  female  38.0      1      0  71.2833         C
2         1      3  female  26.0      0      0   7.9250         S
3         1      1  female  35.0      1      0  53.1000         S
4         0      3    male  35.0      0      0   8.0500         S



   class    who  adult_male deck  embark_town alive  alone
0  Third    man        True  NaN  Southampton    no  False
1  First  woman       False    C    Cherbourg   yes  False
2  Third  woman       False  NaN  Southampton   yes   True
3  First  woman       False    C  Southampton   yes  False
4  Third    man        True  NaN  Southampton    no   True

   num_missing
0            1
1            0
2            1
3            0
4            1

We can then look at the rows with multiple missing values. Since there are too many rows with multiple values to print in this book, let’s randomly sample the results.

print(titanic.loc[titanic.num_missing > 1, :].sample(10))

    survived  pclass     sex  age  sibsp  parch     fare embarked  
470        0       3    male  NaN      0      0   7.2500        S
468        0       3    male  NaN      0      0   7.7250        Q
464        0       3    male  NaN      0      0   8.0500        S
65         1       3    male  NaN      1      1  15.2458        C
330        1       3  female  NaN      2      0  23.2500        Q
109        1       3  female  NaN      1      0  24.1500        Q
121        0       3    male  NaN      0      0   8.0500        S
639        0       3    male  NaN      1      0  16.1000        S
48         0       3    male  NaN      2      0  21.6792        C
837        0       3    male  NaN      0      0   8.0500        S

     class    who  adult_male  deck  embark_town  alive  alone
470  Third    man        True   NaN  Southampton     no   True
468  Third    man        True   NaN   Queenstown     no   True
464  Third    man        True   NaN  Southampton     no   True
65   Third    man        True   NaN    Cherbourg    yes  False
330  Third  woman       False   NaN   Queenstown    yes  False
109  Third  woman       False   NaN   Queenstown    yes  False
121  Third    man        True   NaN  Southampton     no   True
639  Third    man        True   NaN  Southampton     no  False
48   Third    man        True   NaN    Cherbourg     no  False
837  Third    man        True   NaN  Southampton     no   True

     num_missing
470            2
468            2
464            2
65             2
330            2
109            2
121            2
639            2
48             2
837            2

9.5 Vectorized Functions

When we use apply, we are able to make a function work on a column-by-column or row-by-row basis. However, in Section 9.3, we had to rewrite our function when we wanted to apply it because the entire column or row was passed into the first parameter of the function. However, there might be times when it is not feasible to rewrite a function in this way. We can leverage the vectorize function and decorator to vectorize any function. Vectorizing your code can also lead to performance gains (see Section 17.2.1).

Here’s our toy dataframe:

df = pd.DataFrame({'a': [10, 20, 30],
'b': [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40

And here’s our average function, which we can apply on a row-by-row basis:

def avg_2(x, y):
    return (x + y) / 2

For a vectorized function, we’d like to be able to pass in a vector of values for x and a vector of values for y, and the results should be the average of the given x and y values in the same order. In other words, we want to be able to write avg_2(df['a'], df['y']) and get [15, 25, 35] as a result.

print(avg_2(df['a'], df['b']))

0   15.0
1   25.0
2   35.0
dtype: float64

This approach works because the actual calculations within our function are inherently vectorized. That is, if we add two numeric columns together, Pandas (and numpy)will automatically perform element-wise addition. Likewise, when we divide by a scalar, it will broadcast the scalar, and divide each element by the scalar.

Let’s change our function and perform a non-vectorizable calculation.

import numpy as np
def
avg_2_mod(x, y):
    """Calculate the average, unless x is 20
    """
    if (x == 20):
        return(np.NaN)
    else:
        return (x + y) / 2

If we run this function, it will cause an error.

# will cause an error
print(avg_2_mod(df['a'], df['b']))

Traceback (most recent call last):
   File "<ipython-input-1-cb2743ef2888>", line 2, in <module>
    print(avg_2_mod(df['a'], df['b']))
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().

However, if we give it individual numbers, instead of a vector, it will work as expected.

print(avg_2_mod(10, 20))

15.0

print(avg_2_mod(20, 30))

nan

9.5.1 Using numpy

We want to change our function so that when it is given a vector of values, it will perform the calculations in an element-wise manner. We can do this by using the vectorize function from numpy. We pass np.vectorize the function we want to vectorize, to create a new function.

# np.vectorize actually creates a new function
avg_2_mod_vec = np.vectorize(avg_2_mod)
print(avg_2_mod_vec(df['a'], df['b']))

[ 15. nan 35.]

This method works well if you do not have the source code for an existing function. However, if you are writing your own function, you can use a Python decorator to “automatically” vectorize the function without having to create a new function. Decorators are “functions” that take another function as input, and modify how that function’s output behaves.

# to use the vectorize decorator
# we use the @ symbol before our function definition
@np.vectorize
def v_avg_2_mod(x, y):
     """Calculate the average, unless x is 20
     Same as before, but we are using the vectorize decorator
     """

     if (x == 20):
          return(np.NaN)
     else:
          return (x + y) / 2

# we can then directly use the vectorized function
# without having to create a new function
print(v_avg_2_mod(df['a'], df['b']))

[ 15. nan 35.]

9.5.2 Using numba

The numba library2 is designed to optimize Python code, especially calculations on arrays performing mathematical calculations. Just like numpy, it has a vectorize decorator.

import numba

@numba.vectorize
def v_avg_2_numba(x, y):

      """Calculate the average, unless x is 20
      Using the numba decorator.
      """
      # we now have to add type information to our function
      if (int(x) == 20):
           return(np.NaN)
      else:
           return (x + y) / 2
numba does not understand Pandas objects.

print(v_avg_2_numba(df['a'], df['b']))

Traceback (most recent call last):
  File "<ipython-input-1-b03c5b533ae5>", line 2, in <module>
    print(v_avg_2_numba(df['a'], df['b']))
ValueError: cannot determine Numba type of <class
'pandas.core.series.Series'>

We actually have to pass in the numpy array representation of our data (Appendix R).

# passing in the numpy array
print(v_avg_2_numba(df['a'].values, df['b'].values))

[ 15. nan 35.]

9.6 Lambda Functions

Sometimes the function used in the apply method is simple enough that there is no need to create a separate function.

docs = pd.read_csv('../data/doctors.csv', header=None)

We can write a pattern that extracts all the letters from the row, and assign those values to a new 'name' column in our data. We could write our function and apply it as we have done in the past.

import regex

p = regex.compile('w+s+w+')

2. numba: https://numba.pydata.org/

def get_name(s):
     return p.match(s).group()

docs['name_func'] = docs[0].apply(get_name)
print(docs)

                               0              name_func
0     William Hartnell (1963-66)       William Hartnell
1    Patrick Troughton (1966-69)      Patrick Troughton
2          Jon Pertwee (1970 74)            Jon Pertwee
3            Tom Baker (1974-81)              Tom Baker
4        Peter Davison (1982-84)          Peter Davison
5          Colin Baker (1984-86)            Colin Baker
6      Sylvester McCoy (1987-89)        Sylvester McCoy
7             Paul McGann (1996)            Paul McGann
8   Christopher Eccleston (2005)  Christopher Eccleston
9        David Tennant (2005-10)          David Tennant
10          Matt Smith (2010-13)             Matt Smith
11     Peter Capaldi (2014-2017)          Peter Capaldi
12        Jodie Whittaker (2017)        Jodie Whittaker

You can see that the actual function is a simple one-liner. Usually when this happens, people will opt to write the one-liner directly in the apply method. This method is called using lambda functions. We can perform the same operation as shown earlier in the following manner.

docs['name_lamb'] = docs[0].apply(lambda x: p.match(x).group())
print(docs)

                               0              name_func  
0     William Hartnell (1963-66)       William Hartnell
1    Patrick Troughton (1966-69)      Patrick Troughton
2          Jon Pertwee (1970 74)            Jon Pertwee
3            Tom Baker (1974-81)              Tom Baker
4        Peter Davison (1982-84)          Peter Davison
5          Colin Baker (1984-86)            Colin Baker
6      Sylvester McCoy (1987-89)        Sylvester McCoy
7             Paul McGann (1996)            Paul McGann
8   Christopher Eccleston (2005)  Christopher Eccleston
9        David Tennant (2005-10)          David Tennant
10          Matt Smith (2010-13)             Matt Smith
11     Peter Capaldi (2014-2017)          Peter Capaldi
12        Jodie Whittaker (2017)        Jodie Whittaker

                name_lamb
0        William Hartnell
1       Patrick Troughton
2             Jon Pertwee
3               Tom Baker
4           Peter Davison

5            Colin Baker
6        Sylvester McCoy
7            Paul McGann
8  Christopher Eccleston
9          David Tennant
10            Matt Smith
11         Peter Capaldi
12       Jodie Whittaker

To write the lambda function, we use the lambda keyword. Since apply functions will pass the entire axis as the first argument, our lambda function takes only one parameter, x. We can then write our function directly, without having to define it. The calculated result is automatically returned.

Although you can write complex multiple-line lambda functions, typically people will use the lambda function approach when small one-liner calculations are needed. The code can become hard to read if the lambda function tries to do too much at once.

9.7 Conclusion

This chapter covered an important concept—namely, creating functions that can be used on our data. Not all data cleaning steps or manipulations can be done using built-in functions. There will be (many) times when you will have to write your own custom functions to process and analyze data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.51.16