Transforming Data

Another part of tidying data involves transforming existing data into another presentation. This may be needed for the following reasons:

  • Values are not in the correct units
  • Values are qualitative and need to be converted to appropriate numeric values
  • There is extraneous data that either wastes memory and processing time, or can affect results simply by being included

To address these situations, we can take one or more of the following actions:

  • Map values to other values using a table lookup process
  • Explicitly replace certain values with other values (or even another type of data)
  • Apply methods to transform the values based on an algorithm
  • Simply remove extraneous columns and rows

We have already seen how to delete rows and columns with several techniques, so we will not reiterate those here. We will cover the facilities provided by pandas for mapping, replacing, and applying functions to transform data based upon its content.

Mapping

One of the basic tasks in data transformations is mapping of a set of values to another set. pandas provides a generic ability to map values using a lookup table (via a Python dictionary or a pandas Series) using the .map() method. This method performs the mapping by matching the values of the outer Series with the index labels of the inner Series, and returning a new Series with the index labels of the outer Series but the values from the inner Series:

In [46]:
   # create two Series objects to demonstrate mapping
   x = pd.Series({"one": 1, "two": 2, "three": 3})
   y = pd.Series({1: "a", 2: "b", 3: "c"})
   x

Out[46]:
   one      1
   three    3
   two      2
   dtype: int64

In [47]:
   y

Out[47]:
   1    a
   2    b
   3    c
   dtype: object

In [48]:
   # map values in x to values in y 
   x.map(y)

Out[48]:
   one      a
   three    c
   two      b
   dtype: object

Like with other alignment operations, if pandas does not find a map between the value of the outer Series and an index label of the inner Series, it will fill the value with NaN. To demonstrate this, the following code removes the 3 key from the outer Series, therefore causing the alignment to fail for that record, and the result is that a NaN value is introduced:

In [49]:
   # three in x will not align / map to a value in y
   x = pd.Series({"one": 1, "two": 2, "three": 3})
   y = pd.Series({1: "a", 2: "b"})
   x.map(y)

Out[49]:
   one        a
   three    NaN
   two        b
   dtype: object

Replacing values

We previously saw how the .fillna() method can be used to replace the NaN values with a value of your own decision. The .fillna() method can actually be thought of as an implementation of the .map() method that maps a single value, NaN, to a specific value.

Even more generically, the .fillna() method itself can be considered a specialization of a more general replacement that is provided by the .replace() method, which provides more flexibility by being able to replace any value (not just NaN) with another value.

The most basic use of the .replace() method replaces an individual value with another:

In [50]:
   # create a Series to demonstrate replace
   s = pd.Series([0., 1., 2., 3., 2., 4.])
   s

Out[50]:
   0    0
   1    1
   2    2
   3    3
   4    2
   5    4
   dtype: float64

In [51]:
   # replace all items with index label 2 with value 5
   s.replace(2, 5)

Out[51]:
   0    0
   1    1
   2    5
   3    3
   4    5
   5    4
   dtype: float64

It is also possible to specify multiple items to replace and also specify their substitute values by passing two lists:

In [52]:
   # replace all items with new values
   s.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])

Out[52]:
   0    4
   1    3
   2    2
   3    1
   4    2
   5    0
   dtype: float64

Replacement can also be performed by specifying a dictionary for lookup (a variant of the map process in the previous section):

In [53]:
   # replace using entries in a dictionary
   s.replace({0: 10, 1: 100})

Out[53]:
   0     10
   1    100
   2      2
   3      3
   4      2
   5      4
   dtype: float64

If using .replace() on a DataFrame, it is possible to specify different replacement values for each column. This is performed by passing a Python dictionary to the .replace() method, where the keys of the dictionary represent the names of the columns where replacement is to occur and the values of the dictionary are values that you want to replace. The second parameter to the method is the value that will be replaced where any matches are found.

The following code demonstrates by creating a DataFrame object and then replacing specific values in each of the columns with 100:

In [54]:
   # DataFrame with two columns
   df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})
   df

Out[54]:
      a  b
   0  0  5
   1  1  6
   2  2  7
   3  3  8
   4  4  9

In [55]:
   # specify different replacement values for each column
   df.replace({'a': 1, 'b': 8}, 100)

Out[55]:
        a    b
   0    0    5
   1  100    6
   2    2    7
   3    3  100
   4    4    9

Replacing specific values in each of the columns is very convenient, as it provides a shorthand for what otherwise would require coding a loop through all the columns.

It is also possible to replace items at specific index positions as though they are missing values. The following code demonstrates by forward filling the value at index position 0 into locations 1, 2, and 3:

In [56]:
   # demonstrate replacement with pad method
   # set first item to 10, to have a distinct replacement value
   s[0] = 10
   s

Out[56]:
   0    10
   1     1
   2     2
   3     3
   4     2
   5     4
   dtype: float64

In [57]:
   # replace items with index label 1, 2, 3, using fill from the
   # most recent value prior to the specified labels (10)
   s.replace([1, 2, 3], method='pad')

Out[57]:
   0    10
   1    10
   2    10
   3    10
   4    10
   5     4
   dtype: float64

Applying functions to transform data

In situations where a direct mapping or substitution will not suffice, it is possible to apply a function to the data to perform an algorithm on the data. pandas provides the ability to apply functions to individual items, entire columns, or entire rows, providing incredible flexibility in transformation.

Functions can be applied using the conveniently named .apply() method, which given a Python function, will iteratively call the function passing in each value from a Series, or each Series representing a DataFrame column, or a list of values representing each row in a DataFrame. The choice of technique to be used depends on whether the object is a Series or a DataFrame object, and when a DataFrame object, depending upon which axis is specified.

To begin demonstrations, the following code applies a lambda function to each item of a Series:

In [58]:
   # demonstrate applying a function to every item of a Series
   s = pd.Series(np.arange(0, 5))
   s.apply(lambda v: v * 2)

Out[58]:
   0    0
   1    2
   2    4
   3    6
   4    8
   dtype: int64

When applying a function to items in a Series, only the value for each Series item is passed to the function, not the index label and the value.

When a function is applied to a DataFrame, the default is to apply the method to each column. pandas will iterate through all columns passing each as a Series to your function. The result will be a Series object with index labels matching column names and with the result of the function applied to the column:

In [59]:
   # demonstrate applying a sum on each column
   df = pd.DataFrame(np.arange(12).reshape(4, 3), 
                     columns=['a', 'b', 'c'])
   df

Out[59]:
      a   b   c
   0  0   1   2
   1  3   4   5
   2  6   7   8
   3  9  10  11

In [60]:
   # calculate cumulative sum of items in each column
   df.apply(lambda col: col.sum())

Out[60]:
   a    18
   b    22
   c    26
   dtype: int64

Application of the function can be switched to the values from each row by specifying axis=1:

In [61]:
   # calculate the sum of items in each row
   df.apply(lambda row: row.sum(), axis=1)

Out[61]:
   0     3
   1    12
   2    21
   3    30
   dtype: int64

A common practice is to take the result of an apply operation and add it as a new column of the DataFrame. This is convenient as you can add onto the DataFrame the result of one or more successive calculations, providing yourself with progressive representations of the derivation of results through every step of the process.

The following code demonstrates this process. The first step will multiply column a by column b and create a new column named interim. The second step will add those values and column c, and create a result column with those values:

In [62]:
   # create a new column 'interim' with a * b
   df['interim'] = df.apply(lambda r: r.a * r.b, axis=1)
   df

Out[62]:
      a   b   c  interim
   0  0   1   2        0
   1  3   4   5       12
   2  6   7   8       42
   3  9  10  11       90

In [63]:
   # and now a 'result' column with 'interim' + 'c'
   df['result'] = df.apply(lambda r: r.interim + r.c, axis=1)
   df

Out[63]:
      a   b   c  interim  result
   0  0   1   2        0       2
   1  3   4   5       12      17
   2  6   7   8       42      50
   3  9  10  11       90     101

If you would like to change the values in the existing column, simply assign the result to an already existing column. The following code changes the 'a' column values to be the sum of the values in the row:

In [64]:
   # replace column a with the sum of columns a, b and c
   df.a = df.a + df.b + df.c
   df

Out[64]:
       a   b   c  interim  result
   0   3   1   2        0       2
   1  12   4   5       12      17
   2  21   7   8       42      50
   3  30  10  11       90     101

As a matter of practice, replacing a column with completely new values is not the best way to do things and often leads to situations of temporary insanity trying to debug problems caused by data that is lost. In pandas, it is a common practice to just add new rows or columns (or totally new objects), and if memory or performance becomes a problem later on, do the optimizations as required.

Another point to note, is that a pandas DataFrame is not a spreadsheet where cells are assigned formulas and can be recalculated when cells that are referenced by the formula change. If you desire this to happen, you will need to execute the formulas whenever the dependent data changes. On the flip side, this is more efficient than with spreadsheets as every little change does not cause a cascade of operations to occur.

The .apply() method will always apply to the provided function to all of the items, or rows or columns. If you want to apply the function to a subset of these, then first perform a Boolean selection to filter the items you do not want process.

To demonstrate this, the following code creates a DataFrame of values and inserts one NaN value into the second row. It then applies a function to only the rows where all values are not NaN:

In [65]:
   # create a 3x5 DataFrame
   # only second row has a NaN
   df = pd.DataFrame(np.arange(0, 15).reshape(3,5))
   df.loc[1, 2] = np.nan
   df

Out[65]:
       0   1   2   3   4
   0   0   1   2   3   4
   1   5   6 NaN   8   9
   2  10  11  12  13  14

In [66]:
   # demonstrate applying a function to only rows having
   # a count of 0 NaN values
   df.dropna().apply(lambda x: x.sum(), axis=1)

Out[66]:
   0    10
   2    60
   dtype: float64

The last (but not least) method to apply functions that you will see in the next example is the .applymap() method of the DataFrame. The .apply() method was always passed an entire row or column. If you desire to apply a function to every individual item in the DataFrame one by one, then .applymap() is the method to use.

Here is a practical example of using .applymap() method to every item in a DataFrame, and specifically to format each value to a specified number of decimal points:

In [67]:
   # use applymap to format all items of the DataFrame
   df.applymap(lambda x: '%.2f' % x)

Out[67]:
          0      1      2      3      4
   0   0.00   1.00   2.00   3.00   4.00
   1   5.00   6.00    nan   8.00   9.00
   2  10.00  11.00  12.00  13.00  14.00
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.171.212