Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4 Beginning Data Analysis

Introduction

It is important to consider the steps that you, as an analyst, take when you first encounter a dataset after importing it into your workspace as a DataFrame. Is there a set of tasks that you usually undertake to examine the data? Are you aware of all the possible data types? This chapter begins by covering the tasks you might want to undertake when first encountering a new dataset. The chapter proceeds by answering common questions about things that are not that simple to do in pandas.

Developing a data analysis routine

Although there is no standard approach when beginning a data analysis, it is typically a good idea to develop a routine for yourself when first examining a dataset. Similar to everyday routines that we have for waking up, showering, going to work, eating, and so on, a data analysis routine helps you to quickly get acquainted with a new dataset. This routine can manifest itself as a dynamic checklist of tasks that evolves as your familiarity with pandas and data analysis expands.

Exploratory Data Analysis (EDA) is a term used to describe the process of analyzing datasets. Typically it does not involve model creation, but summarizing the characteristics of the data and visualizing them. This is not new and was promoted by John Tukey in his book Exploratory Data Analysis in 1977.

Many of these same processes are still applicable and useful to understand a dataset. Indeed, they can also help with creating machine learning models later.

This recipe covers a small but fundamental part of EDA: the collection of metadata and descriptive statistics in a routine and systematic way. It outlines a standard set of tasks that can be undertaken when first importing any dataset as a pandas DataFrame. This recipe may help form the basis of the routine that you can implement when first examining a dataset.

Metadata describes the dataset or, more aptly, data about the data. Examples of metadata include the number of columns/rows, column names, data types of each column, the source of the dataset, the date of collection, the acceptable values for different columns, and so on. Univariate descriptive statistics are summary statistics about variables (columns) of the dataset, independent of all other variables.

How to do it…

First, some metadata on the college dataset will be collected, followed by basic summary statistics of each column:

Read in the dataset, and view a sample of rows with the .sample method:

>>> import pandas as pd
>>> import numpy as np
>>> college = pd.read_csv("data/college.csv")
>>> college.sample(random_state=42)
           INSTNM         CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
3649  Career P...  San Antonio  ...        20700           14977

Get the dimensions of the DataFrame with the .shape attribute:
```
>>> college.shape
(7535, 27)
```

List the data type of each column, the number of non-missing values, and memory usage with the .info method:

>>> college.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   INSTNM              7535 non-null   object
 1   CITY                7535 non-null   object
 2   STABBR              7535 non-null   object
 3   HBCU                7164 non-null   float64
 4   MENONLY             7164 non-null   float64
 5   WOMENONLY           7164 non-null   float64
 6   RELAFFIL            7535 non-null   int64
 7   SATVRMID            1185 non-null   float64
 8   SATMTMID            1196 non-null   float64
 9   DISTANCEONLY        7164 non-null   float64
 10  UGDS                6874 non-null   float64
 11  UGDS_WHITE          6874 non-null   float64
 12  UGDS_BLACK          6874 non-null   float64
 13  UGDS_HISP           6874 non-null   float64
 14  UGDS_ASIAN          6874 non-null   float64
 15  UGDS_AIAN           6874 non-null   float64
 16  UGDS_NHPI           6874 non-null   float64
 17  UGDS_2MOR           6874 non-null   float64
 18  UGDS_NRA            6874 non-null   float64
 19  UGDS_UNKN           6874 non-null   float64
 20  PPTUG_EF            6853 non-null   float64
 21  CURROPER            7535 non-null   int64
 22  PCTPELL             6849 non-null   float64
 23  PCTFLOAN            6849 non-null   float64
 24  UG25ABV             6718 non-null   float64
 25  MD_EARN_WNE_P10     6413 non-null   object
 26  GRAD_DEBT_MDN_SUPP  7503 non-null   object
dtypes: float64(20), int64(2), object(5)
memory usage: 1.6+ MB

Get summary statistics for the numerical columns and transpose the DataFrame for more readable output:

>>> college.describe(include=[np.number]).T
            count        mean  ...         75%    max
HBCU       7164.0    0.014238  ...    0.000000    1.0
MENONLY    7164.0    0.009213  ...    0.000000    1.0
WOMENONLY  7164.0    0.005304  ...    0.000000    1.0
RELAFFIL   7535.0    0.190975  ...    0.000000    1.0
SATVRMID   1185.0  522.819409  ...  555.000000  765.0
...           ...         ...  ...         ...    ...
PPTUG_EF   6853.0    0.226639  ...    0.376900    1.0
CURROPER   7535.0    0.923291  ...    1.000000    1.0
PCTPELL    6849.0    0.530643  ...    0.712900    1.0
PCTFLOAN   6849.0    0.522211  ...    0.745000    1.0
UG25ABV    6718.0    0.410021  ...    0.572275    1.0

Get summary statistics for the object (string) columns:

>>> college.describe(include=[np.object]).T
             count unique          top  freq
INSTNM        7535   7535  Academy ...     1
CITY          7535   2514     New York    87
STABBR        7535     59           CA   773
MD_EARN_W...  6413    598  PrivacyS...   822
GRAD_DEBT...  7503   2038  PrivacyS...  1510

How it works…

After importing your dataset, a common task is to print out a sample of rows of the DataFrame for manual inspection with the .sample method. The .shape attribute returns some metadata; a tuple containing the number of rows and columns.

A method to get more metadata at once is the .info method. It provides each column name, the number of non-missing values, the data type of each column, and the approximate memory usage of the DataFrame. Usually, a column in pandas has a single type (however, it is possible to have a column that has mixed types, and it will be reported as object). DataFrames, as a whole, might be composed of columns with different data types.

Step 4 and step 5 produce descriptive statistics on different types of columns. By default, .describe outputs a summary for all the numeric columns and silently drops any non-numeric columns. You can pass in other options to the include parameter to include counts and frequencies for a column with non-numeric data types. Technically, the data types are part of a hierarchy where np.number resides above integers and floats.

We can classify data as being either continuous or categorical. Continuous data is always numeric and can usually take on an infinite number of possibilities, such as height, weight, and salary. Categorical data represent discrete values that take on a finite number of possibilities, such as ethnicity, employment status, and car color. Categorical data can be represented numerically or with characters.

Categorical columns are usually going to be either of the type np.object or pd.Categorical. Step 5 ensures that both of these types are represented. In both step 4 and step 5, the output DataFrame is transposed with the .T property. This may ease readability for DataFrames with many columns as it typically allows more data to fit on the screen without scrolling.

There's more…

It is possible to specify the exact quantiles returned from the .describe method when used with numeric columns:

>>> college.describe(
>>>     include=[np.number],
...     percentiles=[
...         0.01,
...         0.05,
...         0.10,
...         0.25,
...         0.5,
...         0.75,
...         0.9,
...         0.95,
...         0.99,
...     ],
... ).T
            count        mean  ...         99%    max
HBCU       7164.0    0.014238  ...    1.000000    1.0
MENONLY    7164.0    0.009213  ...    0.000000    1.0
WOMENONLY  7164.0    0.005304  ...    0.000000    1.0
RELAFFIL   7535.0    0.190975  ...    1.000000    1.0
SATVRMID   1185.0  522.819409  ...  730.000000  765.0
...           ...         ...  ...         ...    ...
PPTUG_EF   6853.0    0.226639  ...    0.946724    1.0
CURROPER   7535.0    0.923291  ...    1.000000    1.0
PCTPELL    6849.0    0.530643  ...    0.993908    1.0
PCTFLOAN   6849.0    0.522211  ...    0.986368    1.0
UG25ABV    6718.0    0.410021  ...    0.917383    1.0

Data dictionaries

A crucial part of data analysis involves creating and maintaining a data dictionary. A data dictionary is a table of metadata and notes on each column of data. One of the primary purposes of a data dictionary is to explain the meaning of the column names. The college dataset uses a lot of abbreviations that are likely to be unfamiliar to an analyst who is inspecting it for the first time.

A data dictionary for the college dataset is provided in the following college_data_dictionary.csv file:

>>> pd.read_csv("data/college_data_dictionary.csv")
    column_name  description
0        INSTNM  Institut...
1          CITY  City Loc...
2        STABBR  State Ab...
3          HBCU  Historic...
4       MENONLY  0/1 Men ...
..          ...          ...
22      PCTPELL  Percent ...
23     PCTFLOAN  Percent ...
24      UG25ABV  Percent ...
25  MD_EARN_...  Median E...
26  GRAD_DEB...  Median d...

As you can see, it is immensely helpful in deciphering the abbreviated column names. DataFrames are not the best place to store data dictionaries. A platform such as Excel or Google Sheets with easy ability to edit values and append columns is a better choice. Alternatively, they can be described in a Markdown cell in Jupyter. A data dictionary is one of the first things that you can share as an analyst with collaborators.

It will often be the case that the dataset you are working with originated from a database whose administrators you will have to contact to get more information. Databases have representations of their data, called schemas. If possible, attempt to investigate your dataset with a Subject Matter Expert (SME – people who have expert knowledge of the data).

Reducing memory by changing data types

pandas has precise technical definitions for many data types. However, when you load data from type-less formats such as CSV, pandas has to infer the type.

This recipe changes the data type of one of the object columns from the college dataset to the special pandas categorical data type to drastically reduce its memory usage.

How to do it…

After reading in our college dataset, we select a few columns of different data types that will clearly show how much memory may be saved:

>>> college = pd.read_csv("data/college.csv")
>>> different_cols = [
...     "RELAFFIL",
...     "SATMTMID",
...     "CURROPER",
...     "INSTNM",
...     "STABBR",
... ]
>>> col2 = college.loc[:, different_cols]
>>> col2.head()
   RELAFFIL  SATMTMID  ...       INSTNM STABBR
0         0     420.0  ...  Alabama ...     AL
1         0     565.0  ...  Universi...     AL
2         1       NaN  ...  Amridge ...     AL
3         0     590.0  ...  Universi...     AL
4         0     430.0  ...  Alabama ...     AL

Inspect the data types of each column:

>>> col2.dtypes
RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

Find the memory usage of each column with the .memory_usage method:

>>> original_mem = col2.memory_usage(deep=True)
>>> original_mem
Index          128
RELAFFIL     60280
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

There is no need to use 64 bits for the RELAFFIL column as it contains only 0 or 1. Let's convert this column to an 8-bit (1 byte) integer with the .astype method:
```
>>> col2["RELAFFIL"] = col2["RELAFFIL"].astype(np.int8)
```

Use the .dtypes attribute to confirm the data type change:

>>> col2.dtypes
RELAFFIL       int8
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

Find the memory usage of each column again and note the large reduction:

>>> col2.memory_usage(deep=True)
Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

To save even more memory, you will want to consider changing object data types to categorical if they have a reasonably low cardinality (number of unique values). Let's first check the number of unique values for both the object columns:
```
>>> col2.select_dtypes(include=["object"]).nunique()
INSTNM    7535
STABBR      59
dtype: int64
```

The STABBR column is a good candidate to convert to categorical as less than one percent of its values are unique:

>>> col2["STABBR"] = col2["STABBR"].astype("category")
>>> col2.dtypes
RELAFFIL        int8
SATMTMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object

Compute the memory usage again:

>>> new_mem = col2.memory_usage(deep=True)
>>> new_mem
Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660699
STABBR       13576
dtype: int64

Finally, let's compare the original memory usage with our updated memory usage. The RELAFFIL column is, as expected, an eighth of its original size, while the STABBR column has shrunk to just three percent of its original size:
```
>>> new_mem / original_mem
Index       1.000000
RELAFFIL    0.125000
SATMTMID    1.000000
CURROPER    1.000000
INSTNM      1.000695
STABBR      0.030538
dtype: float64
```

How it works…

pandas defaults integer and float data types to 64 bits regardless of the maximum necessary size for the particular DataFrame. Integers, floats, and even Booleans may be coerced to a different data type with the .astype method and passing it the exact type, either as a string or specific object, as done in step 4.

The RELAFFIL column is a good choice to cast to a smaller integer type as the data dictionary explains that its values must be 0 or 1. The memory for RELAFFIL is now an eighth of CURROPER, which remains as its former type.

Columns that have an object data type, such as INSTNM, are not like the other pandas data types. For all the other pandas data types, each value in that column is the same data type. For instance, when a column has the int64 type, every column value is also int64. This is not true for columns that have the object data type. Each column value can be of any type. They can have a mix of strings, numerics, datetimes, or even other Python objects such as lists or tuples. For this reason, the object data type is sometimes referred to as a catch-all for a column of data that doesn't match any of the other data types. The vast majority of the time, though, object data type columns will all be strings.

Therefore, the memory of each value in an object data type column is inconsistent. There is no predefined amount of memory for each value like the other data types. For pandas to extract the exact amount of memory of an object data type column, the deep parameter must be set to True in the .memory_usage method.

Object columns are targets for the largest memory savings. pandas has an additional categorical data type that is not available in NumPy. When converting to category, pandas internally creates a mapping from integers to each unique string value. Thus, each string only needs to be kept a single time in memory. As you can see, this change of data type reduced memory usage by 97%.

You might also have noticed that the index uses an extremely low amount of memory. If no index is specified during DataFrame creation, as is the case in this recipe, pandas defaults the index to a RangeIndex. The RangeIndex is very similar to the built-in range function. It produces values on demand and only stores the minimum amount of information needed to create an index.

There's more…

To get a better idea of how object data type columns differ from integers and floats, a single value from each one of these columns can be modified and the resulting memory usage displayed. The CURROPER and INSTNM columns are of int64 and object types, respectively:

>>> college.loc[0, "CURROPER"] = 10000000
>>> college.loc[0, "INSTNM"] = (
...     college.loc[0, "INSTNM"] + "a"
... )
>>> college[["CURROPER", "INSTNM"]].memory_usage(deep=True)
Index           80
CURROPER     60280
INSTNM      660804
dtype: int64

Memory usage for CURROPER remained the same since a 64-bit integer is more than enough space for the larger number. On the other hand, the memory usage for INSTNM increased by 105 bytes by just adding a single letter to one value.

Python 3 uses Unicode, a standardized character representation intended to encode all the world's writing systems. How much memory Unicode strings take on your machine depends on how Python was built. On this machine, it uses up to 4 bytes per character. pandas has some overhead (100 bytes) when making the first modification to a character value. Afterward, increments of 5 bytes per character are sustained.

Not all columns can be coerced to the desired type. Take a look at the MENONLY column, which, from the data dictionary, appears to contain only 0s or 1s. The actual data type of this column upon import unexpectedly turns out to be float64. The reason for this is that there happen to be missing values, denoted by np.nan. There is no integer representation for missing values for the int64 type (note that the Int64 type found in pandas 0.24+ does support missing values, but it is not used by default). Any numeric column with even a single missing value will be turned into a float column. Furthermore, any column of an integer data type will automatically be coerced to a float if one of the values becomes missing:

>>> college["MENONLY"].dtype
dtype('float64')
>>> college["MENONLY"].astype(np.int8)
Traceback (most recent call last):
  ...
ValueError: Cannot convert non-finite values (NA or inf) to integer

Additionally, it is possible to substitute string names in place of Python objects when referring to data types. For instance, when using the include parameter in the .describe DataFrame method, it is possible to pass a list of either the NumPy or pandas objects or their equivalent string representation. For instance, each of the following produces the same result:

college.describe(include=['int64', 'float64']).T
    
college.describe(include=[np.int64, np.float64]).T
college.describe(include=['int', 'float']).T
college.describe(include=['number']).T

The type strings can also be used in combination with the .astype method:

>>> college.assign(
...     MENONLY=college["MENONLY"].astype("float16"),
...     RELAFFIL=college["RELAFFIL"].astype("int8"),
... )
           INSTNM         CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
0     Alabama ...       Normal  ...        30300           33888       
1     Universi...   Birmingham  ...        39700         21941.5       
2     Amridge ...   Montgomery  ...        40100           23370       
3     Universi...   Huntsville  ...        45500           24097       
4     Alabama ...   Montgomery  ...        26600         33118.5       
...           ...          ...  ...          ...             ...       
7530  SAE Inst...   Emeryville  ...          NaN            9500       
7531  Rasmusse...  Overland...  ...          NaN           21163       
7532  National...  Highland...  ...          NaN            6333       
7533  Bay Area...     San Jose  ...          NaN     PrivacyS...       
7534  Excel Le...  San Antonio  ...          NaN           12125

Lastly, it is possible to see the enormous memory difference between the minimal RangeIndex and Int64Index, which stores every row index in memory:

>>> college.index = pd.Int64Index(college.index)
>>> college.index.memory_usage()  # previously was just 80
60280

Selecting the smallest of the largest

This recipe can be used to create catchy news headlines such as Out of the Top 100 Universities, These 5 have the Lowest Tuition, or From the Top 50 Cities to Live, these 10 are the Most Affordable.

During analysis, it is possible that you will first need to find a grouping of data that contains the top n values in a single column and, from this subset, find the bottom m values based on a different column.

In this recipe, we find the five lowest budget movies from the top 100 scoring movies by taking advantage of the convenience methods: .nlargest and .nsmallest.

How to do it…

Read in the movie dataset, and select the columns: movie_title, imdb_score, and budget:

>>> movie = pd.read_csv("data/movie.csv")
>>> movie2 = movie[["movie_title", "imdb_score", "budget"]]
>>> movie2.head()
   movie_title  imdb_score       budget
0       Avatar         7.9  237000000.0
1  Pirates ...         7.1  300000000.0
2      Spectre         6.8  245000000.0
3  The Dark...         8.5  250000000.0
4  Star War...         7.1          NaN

Use the .nlargest method to select the top 100 movies by imdb_score:

>>> movie2.nlargest(100, "imdb_score").head()
      movie_title  imdb_score      budget
                   movie_title  imdb_score      budget
2725          Towering Inferno         9.5         NaN
1920  The Shawshank Redemption         9.3  25000000.0
3402             The Godfather         9.2   6000000.0
2779                   Dekalog         9.1         NaN
4312      Kickboxer: Vengeance         9.1  17000000.0

Chain the .nsmallest method to return the five lowest budget films among those with a top 100 score:

>>> (
...     movie2.nlargest(100, "imdb_score").nsmallest(
...         5, "budget"
...     )
... )
               movie_title  imdb_score    budget
4804        Butterfly Girl         8.7  180000.0
4801    Children of Heaven         8.5  180000.0
4706          12 Angry Men         8.9  350000.0
4550          A Separation         8.4  500000.0
4636  The Other Dream Team         8.4  500000.0

How it works…

The first parameter of the .nlargest method, n, must be an integer and selects the number of rows to be returned. The second parameter, columns, takes a column name as a string. Step 2 returns the 100 highest-scoring movies. We could have saved this intermediate result as its own variable but instead, we chain the .nsmallest method to it in step 3, which returns exactly five rows, sorted by budget.

There's more…

It is possible to pass a list of column names to the columns parameter of the .nlargest and .nsmallest methods. This would only be useful to break ties in the event that there were duplicate values sharing the nth ranked spot in the first column in the list.

Selecting the largest of each group by sorting

One of the most basic and common operations to perform during data analysis is to select rows containing the largest value of some column within a group. For instance, this would be like finding the highest-rated film of each year or the highest-grossing film by content rating. To accomplish this task, we need to sort the groups as well as the column used to rank each member of the group, and then extract the highest member of each group.

In this recipe, we will find the highest-rated film of each year.

How to do it…

Read in the movie dataset and slim it down to just the three columns we care about: movie_title, title_year, and imdb_score:

>>> movie = pd.read_csv("data/movie.csv")
>>> movie[["movie_title", "title_year", "imdb_score"]]
                                     movie_title  ...
0                                         Avatar  ...
1       Pirates of the Caribbean: At World's End  ...
2                                        Spectre  ...
3                          The Dark Knight Rises  ...
4     Star Wars: Episode VII - The Force Awakens  ...
...                                          ...  ...
4911                     Signed Sealed Delivered  ...
4912                               The Following  ...
4913                        A Plague So Pleasant  ...
4914                            Shanghai Calling  ...
4915                           My Date with Drew  ...

Use the .sort_values method to sort the DataFrame by title_year. The default behavior sorts from the smallest to the largest. Use the ascending=True parameter to invert this behavior:

>>> (
...     movie[
...         ["movie_title", "title_year", "imdb_score"]
...     ].sort_values("title_year", ascending=True)
... )
                                           movie_title  ...
4695  Intolerance: Love's Struggle Throughout the Ages  ...
4833                    Over the Hill to the Poorhouse  ...
4767                                    The Big Parade  ...
2694                                        Metropolis  ...
4697                               The Broadway Melody  ...
...                                                ...  ...
4683                                            Heroes  ...
4688                                       Home Movies  ...
4704                                        Revolution  ...
4752                                      Happy Valley  ...
4912                                     The Following  ...

Notice how only the year was sorted. To sort multiple columns at once, use a list. Let's look at how to sort both year and score:

>>> (
...     movie[
...          ["movie_title", "title_year", "imdb_score"]
...     ].sort_values(
...         ["title_year", "imdb_score"], ascending=False
...     )
... )
                      movie_title  title_year  imdb_score
4312         Kickboxer: Vengeance      2016.0         9.1
4277  A Beginner's Guide to Snuff      2016.0         8.7
3798                      Airlift      2016.0         8.5
27     Captain America: Civil War      2016.0         8.2
98            Godzilla Resurgence      2016.0         8.2
...                           ...         ...         ...
1391                    Rush Hour         NaN         5.8
4031                     Creature         NaN         5.0
2165              Meet the Browns         NaN         3.5
3246   The Bold and the Beautiful         NaN         3.5
2119                 The Bachelor         NaN         2.9

Now, we use the .drop_duplicates method to keep only the first row of every year:

>>> (
...     movie[["movie_title", "title_year", "imdb_score"]]
...     .sort_values(
...         ["title_year", "imdb_score"], ascending=False
...     )
...     .drop_duplicates(subset="title_year")
... )
      movie_title  title_year  imdb_score
4312  Kickboxe...      2016.0         9.1
3745  Running ...      2015.0         8.6
4369  Queen of...      2014.0         8.7
3935  Batman: ...      2013.0         8.4
3     The Dark...      2012.0         8.5
...           ...         ...         ...
2694   Metropolis      1927.0         8.3
4767  The Big ...      1925.0         8.3
4833  Over the...      1920.0         4.8
4695  Intolera...      1916.0         8.0
2725  Towering...         NaN         9.5

How it works…

This example shows how I use chaining to build up and test a sequence of pandas operations. In step 1, we slim the dataset down to concentrate on only the columns of importance. This recipe would work the same with the entire DataFrame. Step 2 shows how to sort a DataFrame by a single column, which is not exactly what we wanted. Step 3 sorts multiple columns at the same time. It works by first sorting all of title_year and then, within each value of title_year, sorts by imdb_score.

The default behavior of the .drop_duplicates method is to keep the first occurrence of each unique row, which would not drop any rows as each row is unique. However, the subset parameter alters it to only consider the column (or list of columns) given to it. In this example, only one row for each year will be returned. As we sorted by year and score in the last step, the highest-scoring movie for each year is what we get.

There's more…

As in most things pandas, there is more than one way to do this. If you find yourself comfortable with grouping operations, you can use the .groupby method to do this as well:

>>> (
...     movie[["movie_title", "title_year", "imdb_score"]]
...     .groupby("title_year", as_index=False)
...     .apply(
...         lambda df:df.sort_values(
...             "imdb_score", ascending=False
...         ).head(1)
...     )
...     .droplevel(0)
...     .sort_values("title_year", ascending=False)
... )
         movie_title  title_year  imdb_score
90 4312  Kickboxe...      2016.0         9.1
89 3745  Running ...      2015.0         8.6
88 4369  Queen of...      2014.0         8.7
87 3935  Batman: ...      2013.0         8.4
86 3     The Dark...      2012.0         8.5
...              ...         ...         ...
4  4555  Pandora'...      1929.0         8.0
3  2694   Metropolis      1927.0         8.3
2  4767  The Big ...      1925.0         8.3
1  4833  Over the...      1920.0         4.8
0  4695  Intolera...      1916.0         8.0

It is possible to sort one column in ascending order while simultaneously sorting another column in descending order. To accomplish this, pass in a list of Booleans to the ascending parameter that corresponds to how you would like each column sorted. The following sorts title_year and content_rating in descending order and budget in ascending order. It then finds the lowest budget film for each year and content rating group:

>>> (
...     movie[
...         [
...             "movie_title",
...             "title_year",
...             "content_rating",
...             "budget",
...         ]
...     ]
...     .sort_values(
...         ["title_year", "content_rating", "budget"],
...         ascending=[False, False, True],
...     )
...     .drop_duplicates(
...         subset=["title_year", "content_rating"]
...     )
... )
      movie_title  title_year content_rating      budget
4026    Compadres      2016.0            R     3000000.0
4658  Fight to...      2016.0        PG-13      150000.0
4661   Rodeo Girl      2016.0           PG      500000.0
3252  The Wailing      2016.0    Not Rated           NaN
4659  Alleluia...      2016.0          NaN      500000.0
...           ...         ...          ...           ...
2558   Lilyhammer         NaN        TV-MA    34000000.0
807   Sabrina,...         NaN         TV-G     3000000.0
848   Stargate...         NaN        TV-14     1400000.0
2436       Carlos         NaN    Not Rated           NaN
2119  The Bach...         NaN          NaN     3000000.0

By default, .drop_duplicates keeps the very first appearance of a value, but this behavior may be modified by passing keep='last' to select the last row of each group or keep=False to drop all duplicates entirely.

Replicating nlargest with sort_values

The previous two recipes work similarly by sorting values in slightly different manners. Finding the top n values of a column of data is equivalent to sorting the entire column in descending order and taking the first n values. pandas has many operations that are capable of doing this in a variety of ways.

In this recipe, we will replicate the Selecting the smallest of the largest recipe with the .sort_values method and explore the differences between the two.

How to do it…

Let's recreate the result from the final step of the Selecting the smallest of the largest recipe:

>>> movie = pd.read_csv("data/movie.csv")
>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .nlargest(100, "imdb_score")
...     .nsmallest(5, "budget")
... )
               movie_title  imdb_score    budget
4804        Butterfly Girl         8.7  180000.0
4801    Children of Heaven         8.5  180000.0
4706          12 Angry Men         8.9  350000.0
4550          A Separation         8.4  500000.0
4636  The Other Dream Team         8.4  500000.0

Use .sort_values to replicate the first part of the expression and grab the first 100 rows with the .head method:

>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .sort_values("imdb_score", ascending=False)
...     .head(100)
... )
      movie_title  imdb_score      budget
2725  Towering...         9.5         NaN
1920  The Shaw...         9.3  25000000.0
3402  The Godf...         9.2   6000000.0
2779      Dekalog         9.1         NaN
4312  Kickboxe...         9.1  17000000.0
...           ...         ...         ...
3799  Anne of ...         8.4         NaN
3777  Requiem ...         8.4   4500000.0
3935  Batman: ...         8.4   3500000.0
4636  The Othe...         8.4    500000.0
2455       Aliens         8.4  18500000.0

Now that we have the top 100 scoring movies, we can use .sort_values with .head again to grab the lowest five by budget:

>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .sort_values("imdb_score", ascending=False)
...     .head(100)
...     .sort_values("budget")
...     .head(5)
... )
                    movie_title  imdb_score    budget
4815  A Charlie Brown Christmas         8.4  150000.0
4801         Children of Heaven         8.5  180000.0
4804             Butterfly Girl         8.7  180000.0
4706               12 Angry Men         8.9  350000.0
4636       The Other Dream Team         8.4  500000.0

How it works…

The .sort_values method can nearly replicate .nlargest by chaining the .head method after the operation, as seen in step 2. Step 3 replicates .nsmallest by chaining another .sort_values method and completes the query by taking just the first five rows with the .head method.

Take a look at the output from the first DataFrame from step 1 and compare it with the output from step 3. Are they the same? No! What happened? To understand why the two results are not equivalent, let's look at the tail of the intermediate steps of each recipe:

>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .nlargest(100, "imdb_score")
...     .tail()
... )
                movie_title  imdb_score     budget
4023                 Oldboy         8.4  3000000.0
4163  To Kill a Mockingbird         8.4  2000000.0
4395         Reservoir Dogs         8.4  1200000.0
4550           A Separation         8.4   500000.0
4636   The Other Dream Team         8.4   500000.0
>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .sort_values("imdb_score", ascending=False)
...     .head(100)
...     .tail()
... )
      movie_title  imdb_score      budget
3799  Anne of ...         8.4         NaN
3777  Requiem ...         8.4   4500000.0
3935  Batman: ...         8.4   3500000.0
4636  The Othe...         8.4    500000.0
2455       Aliens         8.4  18500000.0

The issue arises because more than 100 movies exist with a rating of at least 8.4. Each of the methods, .nlargest and .sort_values, breaks ties differently, which results in a slightly different 100-row DataFrame. If you pass in kind='mergsort' to the .sort_values method, you will get the same result as .nlargest.

Calculating a trailing stop order price

There are many strategies to trade stocks. One basic type of trade that many investors employ is the stop order. A stop order is an order placed by an investor to buy or sell a stock that executes whenever the market price reaches a certain point. Stop orders are useful to both prevent huge losses and protect gains.

For this recipe, we will only be examining stop orders used to sell currently owned stocks. In a typical stop order, the price does not change throughout the lifetime of the order. For instance, if you purchased a stock for $100 per share, you might want to set a stop order at $90 per share to limit your downside to 10%.

A more advanced strategy would be to continually modify the sale price of the stop order to track the value of the stock if it increases in value. This is called a trailing stop order. Concretely, if the same $100 stock increases to $120, then a trailing stop order 10% below the current market value would move the sale price to $108.

The trailing stop order never moves down and is always tied to the maximum value since the time of purchase. If the stock fell from $120 to $110, the stop order would still remain at $108. It would only increase if the price moved above $120.

This recipe requires the use of the third-party package pandas-datareader, which fetches stock market prices online. It does not come pre-installed with pandas. To install this package, use the command line and run conda install pandas-datareader or pip install pandas-datareader. You may need to install the requests_cache library as well.

This recipe determines the trailing stop order price given an initial purchase price for any stock.

How to do it…

To get started, we will work with Tesla Motors (TSLA) stock and presume a purchase on the first trading day of 2017:

>>> import datetime
>>> import pandas_datareader.data as web
>>> import requests_cache
>>> session = requests_cache.CachedSession(
...     cache_name="cache",
...     backend="sqlite",
...     expire_after=datetime.timedelta(days=90),
... )
>>> tsla = web.DataReader(
...     "tsla",
...     data_source="yahoo",
...     start="2017-1-1",
...     session=session,
... )
>>> tsla.head(8)
                  High         Low  ...    Volume   Adj Close
Date                                ...
2017-01-03  220.330002  210.960007  ...   5923300  216.990005
2017-01-04  228.000000  214.309998  ...  11213500  226.990005
2017-01-05  227.479996  221.949997  ...   5911700  226.750000
2017-01-06  230.309998  225.449997  ...   5527900  229.009995
2017-01-09  231.919998  228.000000  ...   3979500  231.279999
2017-01-10  232.000000  226.889999  ...   3660000  229.869995
2017-01-11  229.979996  226.679993  ...   3650800  229.729996
2017-01-12  230.699997  225.580002  ...   3790200  229.589996

For simplicity, we will work with the closing price of each trading day:
```
>>> tsla_close = tsla["Close"]
```

Use the .cummax method to track the highest closing price until the current date:

>>> tsla_cummax = tsla_close.cummax()
>>> tsla_cummax.head()
Date
2017-01-03    216.990005
2017-01-04    226.990005
2017-01-05    226.990005
2017-01-06    229.009995
2017-01-09    231.279999
Name: Close, dtype: float64

To limit the downside to 10%, we multiply the result by 0.9. This creates the trailing stop order. We will chain all of the steps together:

>>> (tsla["Close"].cummax().mul(0.9).head())
Date
2017-01-03    195.291005
2017-01-04    204.291005
2017-01-05    204.291005
2017-01-06    206.108995
2017-01-09    208.151999
Name: Close, dtype: float64

How it works…

The .cummax method works by retaining the maximum value encountered up to and including the current value. Multiplying this series by 0.9, or whatever cushion you would like to use, creates the trailing stop order. In this particular example, TSLA increased in value, and thus, its trailing stop has also increased.

There's more…

This recipe gives just a taste of how useful pandas may be used to trade securities and stops short of calculating a return for if and when the stop order triggers.

A very similar strategy may be used during a weight-loss program. You can set a warning any time you have strayed too far away from your minimum weight. pandas provides you with the cummin method to track the minimum value. If you keep track of your daily weight in a series, the following code provides a trailing weight loss of 5% above your lowest recorded weight to date:

weight.cummin() * 1.05

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Beginning Data Analysis

Create new playlist

Sign In

Sign Up

4

Beginning Data Analysis

Introduction

Developing a data analysis routine

How to do it…

How it works…

There's more…

Data dictionaries

Reducing memory by changing data types

How to do it…

How it works…

There's more…

Selecting the smallest of the largest

How to do it…

How it works…

There's more…

Selecting the largest of each group by sorting

How to do it…

How it works…

There's more…

Replicating nlargest with sort_values

How to do it…

How it works…

Calculating a trailing stop order price

How to do it…

How it works…

There's more…

Table of Contents for
Beginning Data Analysis