Appendix

1. Introduction to Data Wrangling with Python

Activity 1.01: Handling Lists

Solution:

These are the steps to complete this activity:

  1. Import the random library:

    import random

  2. Use the randint method from the random library to create 100 random numbers:

    random_number_list = [random.randint(0, 100)

                          for x in range(0, 100)]

  3. Print random_number_list:

    random_number_list

    The sample output is as follows:

    Figure 1.20: Section of output for random_number_list

    Figure 1.20: Section of output for random_number_list

    Note

    The output is susceptible to change since we are generating random numbers.

  4. Create a list_with_divisible_by_3 list from random_number_list, which will contain only numbers that are divisible by 3:

    list_with_divisible_by_3 = [a for a in

                                random_number_list if a % 3 == 0]

    list_with_divisible_by_3

    The sample output is as follows:

    Figure 1.21: Section of the output for random_number_list divisible by 3

    Figure 1.21: Section of the output for random_number_list divisible by 3

  5. Use the len function to measure the length of the first list and the second list, and store them in two different variables, length_of_random_list and length_of_3_divisible_list. Calculate the difference in length in a variable called difference:

    length_of_random_list = len(random_number_list)

    length_of_3_divisible_list = len(list_with_divisible_by_3)

    difference = length_of_random_list - length_of_3_divisible_list

    difference

    The sample output is as follows:

    71

  6. Combine the tasks we have performed so far and add a for loop to it. Run the loop 10 times and add the values of the difference variables to a list:

    NUMBER_OF_EXPERIMENTS = 10

    difference_list = []

    for i in range(0, NUMBER_OF_EXPERIMENTS):

        random_number_list = [random.randint(0, 100)

                              for x in range(0, 100)]

        list_with_divisible_by_3 = [a for a in random_number_list

                                    if a % 3 == 0]

        

        length_of_random_list = len(random_number_list)

        length_of_3_divisible_list = len(list_with_divisible_by_3)

        difference = length_of_random_list

                     - length_of_3_divisible_list

        difference_list.append(difference)

    difference_list

    The sample output is as follows:

    [64, 61, 67, 60, 73, 66, 66, 75, 70, 61]

  7. Then, calculate the arithmetic mean (common average) for the differences in the lengths that you have:

    avg_diff = sum(difference_list) / float(len(difference_list))

    avg_diff

    The sample output is as follows:

    66.3

    Note

    The output is susceptible to change since we have used random numbers.

    To access the source code for this specific section, please refer to https://packt.live/30VMjt3.

    You can also run this example online at https://packt.live/3eh0JIb.

With this, we have successfully completed our first activity. Let's move on to the next section, where we will discuss another type of data structure – sets.

Activity 1.02: Analyzing a Multiline String and Generating the Unique Word Count

Solution:

These are the steps to complete this activity:

  1. Open a new Jupyter Notebook, create a string called multiline_text, and copy the text present in the first chapter of Pride and Prejudice.

    Note

    Part of the first chapter of Pride and Prejudice by Jane Austen has been made available on this book's GitHub repository at https://packt.live/2N6ZGP6.

    Use Ctrl + A to select the entire text and then Ctrl + C to copy it and use Ctrl + V to paste the text you just copied into it:

    Figure 1.22: Initializing the mutliline_text string

    Figure 1.22: Initializing the mutliline_text string

  2. Find the type of the string using the type function:

    type(multiline_text)

    The output is as follows:

    str

  3. Now, find the length of the string using the len function:

    len(multiline_text)

    The output is as follows:

    1228

  4. Use string methods to get rid of all the new lines ( or ) and symbols. Remove all new lines by replacing them with the following:

    multiline_text = multiline_text.replace(' ', "")

  5. Then, we will print and check the output:

    multiline_text

    The output is as follows:

    Figure 1.23: The multiline_text string after removing the new lines

    Figure 1.23: The multiline_text string after removing the new lines

  6. Remove the special characters and punctuation:

    # remove special chars, punctuation etc.

    cleaned_multiline_text = ""

    for char in multiline_text:

        if char == " ":

            cleaned_multiline_text += char

        elif char.isalnum():  # using the isalnum() method of strings.

            cleaned_multiline_text += char

        else:

            cleaned_multiline_text += " "

  7. Check the content of cleaned_multiline_text:

    cleaned_multiline_text

    The output is as follows:

    Figure 1.24: The cleaned_multiline_text string

    Figure 1.24: The cleaned_multiline_text string

  8. Generate a list of all the words from the cleaned string using the following command:

    list_of_words = cleaned_multiline_text.split()

    list_of_words

    The section of the output is shown below:

    Figure 1.25: The section of output displaying the list_of_words

    Figure 1.25: The section of output displaying the list_of_words

  9. Find the number of words:

    len(list_of_words)

    The output is 236.

  10. Create a list from the list you just created, which includes only unique words:

    unique_words_as_dict = dict.fromkeys(list_of_words)

    len(list(unique_words_as_dict.keys()))

    The output is 135.

  11. Count the number of times each of the unique words appeared in the cleaned text:

    for word in list_of_words:

        if unique_words_as_dict[word] is None:

            unique_words_as_dict[word] = 1

        else:

            unique_words_as_dict[word] += 1

    unique_words_as_dict

    The section of the output is shown below:

    Figure 1.26: Section of output showing unique_words_as_dict

    Figure 1.26: Section of output showing unique_words_as_dict

    You just created, step by step, a unique word counter using all the neat tricks that you've learned about in this chapter.

  12. Find the top 25 words from unique_words_as_dict:

    top_words = sorted(unique_words_as_dict.items(),

                       key=lambda key_val_tuple: key_val_tuple[1],

                       reverse=True)

    top_words[:25]

    The output (partially shown) is as follows:

    Figure 1.27: Top 25 unique words from multiline_text

Figure 1.27: Top 25 unique words from multiline_text

Note

To access the source code for this specific section, please refer to https://packt.live/2ASNIWL.

You can also run this example online at https://packt.live/3dcIKkz.

2. Advanced Operations on Built-In Data Structures

Activity 2.01: Permutation, Iterator, Lambda, and List

Solution:

These are the detailed steps to solve this activity:

  1. Look up the definition of permutations and dropwhile from itertools. There is a way to look up the definition of a function inside Jupyter itself. Just type the function name, followed by ?, and press Shift + Enter:

    from itertools import permutations, dropwhile

    permutations?

    dropwhile?

    You will see a long list of definitions after each ?. We will skip it here.

  2. Write an expression to generate all the possible three-digit numbers using 1, 2, and 3:

    permutations(range(3))

    The output (which will vary in your case) is as follows:

    <itertools.permutations at 0x7f6c6c077af0>

  3. Loop over the iterator expression you generated before. Use the print method to print each element returned by the iterator. Use assert and isinstance to make sure that the elements are tuples:

    for number_tuple in permutations(range(3)):

        print(number_tuple)

        assert isinstance(number_tuple, tuple)

    The output is as follows:

    (0, 1, 2)

    (0, 2, 1)

    (1, 0, 2)

    (1, 2, 0)

    (2, 0, 1)

    (2, 1, 0)

  4. Write the loop again. But this time, use dropwhile with a lambda expression to drop any leading zeros from the tuples. As an example, (0, 1, 2) will become [0, 2]. Also, cast the output of dropwhile to a list.

    An extra task can be to check the actual type that dropwhile returns without casting:

    for number_tuple in permutations(range(3)):

        print(list(dropwhile(lambda x: x <= 0, number_tuple)))

    The output is as follows:

    [1, 2]

    [2, 1]

    [1, 0, 2]

    [1, 2, 0]

    [2, 0, 1]

    [2, 1, 0]

  5. Write all the logic you wrote before, but this time write a separate function where you will be passing the list generated from dropwhile; the function will return the whole number contained in the list. As an example, if you pass [1, 2] to the function, it will return 12. Make sure that the return type is indeed a number and not a string. Although this task can be achieved using other tricks, we require you to treat the incoming list as a stack in the function and generate the number there:

    import math

    def convert_to_number(number_stack):

        final_number = 0

        for i in range(0, len(number_stack)):

            final_number += (number_stack.pop()

                             * (math.pow(10, i)))

        return final_number

    for number_tuple in permutations(range(3)):

        number_stack = list(dropwhile(lambda x: x <= 0, number_tuple))

        print(convert_to_number(number_stack))

    The output is as follows:

    12.0

    21.0

    102.0

    120.0

    201.0

    210.0

    Note

    To access the source code for this specific section, please refer to https://packt.live/37Gk9DT.

    You can also run this example online at https://packt.live/3hEWt7f.

Activity 2.02: Designing Your Own CSV Parser

Solution:

These are the detailed steps to solve this activity:

  1. Import zip_longest from itertools:

    from itertools import zip_longest

  2. Define the return_dict_from_csv_line function so that it contains header, line, and fillvalue as None, and add it to a dictionary:

    def return_dict_from_csv_line(header, line):

        # Zip them

        zipped_line = zip_longest(header, line, fillvalue=None)

        # Use dict comprehension to generate the final dict

        ret_dict = {kv[0]: kv[1] for kv in zipped_line}

        return ret_dict

  3. Open the accompanying sales_record.csv file using r mode inside a with block. First, check that it is opened, read the first line, and use string methods to generate a list of all the column names as follows:

    with open("../datasets/sales_record.csv", "r") as fd:

    Note

    Don't forget to change the path (highlighted) based on where you have stored the csv file.

  4. When you read each line, pass that line to a function along with the list of the headers. The work of the function is to construct a dictionary out of these two and fill up the key:values variables. Keep in mind that a missing value should result in None:

        first_line = fd.readline()

        header = first_line.replace(" ", "").split(",")

        for i, line in enumerate(fd):

            line = line.replace(" ", "").split(",")

            d = return_dict_from_csv_line(header, line)

            print(d)

            if i > 10:

                break

    The output (partially shown) is as follows:

    Figure 2.15: Section of output

Figure 2.15: Section of output

Note

To access the source code for this specific section, please refer to https://packt.live/37FlVVK.

You can also run this example online at https://packt.live/2YepGyb.

3. Introduction to NumPy, Pandas, and Matplotlib

Activity 3.01: Generating Statistics from a CSV File

Solution:

These are the steps to complete this activity:

  1. Load the necessary libraries:

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

  2. Read in the Boston Housing dataset (given as a .csv file) from the local directory:

    df=pd.read_csv("../datasets/Boston_housing.csv")

    Note

    Don't forget to change the path of the dataset (highlighted) based on where it is saved on your system.

  3. Check the first 10 records:

    df.head(10)

    The output is as follows:

    Figure 3.30: Output displaying the first 10 records

    Figure 3.30: Output displaying the first 10 records

  4. Find the total number of records:

    df.shape

    The output is as follows:

    (506, 14)

  5. Create a smaller DataFrame with columns that do not include CHAS, NOX, B, and LSTAT:

    df1=df[['CRIM','ZN','INDUS',

            'RM','AGE','DIS','RAD',

            'TAX','PTRATIO','PRICE']]

  6. Check the last 7 records of the new DataFrame you just created:

    df1.tail(7)

    The output is as follows:

    Figure 3.31: Last seven records of the DataFrame

    Figure 3.31: Last seven records of the DataFrame

  7. Plot histograms of all the variables (columns) in the new DataFrame by using a for loop:

    for c in df1.columns:

        plt.title("Plot of "+c,fontsize=15)

        plt.hist(df1[c],bins=20)

        plt.show()

    The output is as follows:

    Figure 3.32: Partial plot of all variables using a for loop

    Figure 3.32: Partial plot of all variables using a for loop

    Note

    To take a look at all the plots, head over to the following link: https://packt.live/2AGb95F.

    Crime rate could be an indicator of house price (people don't want to live in high-crime areas). In some cases, having multiple charts together can allow for the easy analysis of a variety of variables. In the preceding group of charts, we can see several unique spikes in the data: INDIUS, TAX, and RAD. With further exploratory analysis, we can find out more. We might want to plot one variable against another after looking at the preceding group of charts.

  8. Create a scatter plot of crime rate versus price:

    plt.scatter(df1['CRIM'], df1['PRICE'])

    plt.show()

    The output is as follows:

    Figure 3.33: Scatter plot of crime rate versus price

    Figure 3.33: Scatter plot of crime rate versus price

    We can understand this relationship better if we plot log10(crime) versus price.

  9. Create a plot of log10(crime) versus price:

    plt.scatter(np.log10(df1['CRIM']),df1['PRICE'], c='red')

    plt.title("Crime rate (Log) vs. Price plot", fontsize=18)

    plt.xlabel("Log of Crime rate",fontsize=15)

    plt.ylabel("Price",fontsize=15)

    plt.grid(True)

    plt.show()

    The output is as follows:

    Figure 3.34: Scatter plot of crime rate (Log) versus price

    Figure 3.34: Scatter plot of crime rate (Log) versus price

  10. Calculate the mean rooms per dwelling:

    df1['RM'].mean()

    The output is as follows:

    6.284634387351788

  11. Calculate the median age:

    df1['AGE'].median()

    The output is as follows:

    77.5

  12. Calculate the average (mean) distances to five Boston employment centers:

    df1['DIS'].mean()

    The output is as follows:

    3.795042687747034

  13. Calculate the price of the housing that's less than 20:

    low_price=df1['PRICE']<20

    print(low_price)

    The output is as follows:

    Figure 3.35: Output of low_price

    Figure 3.35: Output of low_price

    This creates a Boolean array of True, False, True = 1, and False = 0. If you take an average of this NumPy array, you will know how many 1(True) values are there.

  14. Calculate the mean of this array:

    # That many houses are priced below 20,000.

    # So that is the answer.

    low_price.mean()

    The output is:

    0.4150197628458498

  15. Calculate the percentage of houses with a low price (< $20,000):

    # You can convert that into percentage

    # Do this by multiplying with 100

    pcnt=low_price.mean()*100

    print(" Percentage of house with <20,000 price is: ", pcnt)

    The output is as follows:

    Percentage of house with <20,000 price is: 41.50197628458498

    Note

    To access the source code for this specific section, please refer to https://packt.live/2AGb95F.

    You can also run this example online at https://packt.live/2YT3Hfg.

4. A Deep Dive into Data Wrangling with Python

Activity 4.01: Working with the Adult Income Dataset (UCI)

Solution:

These are the steps to complete this activity:

  1. Load the necessary libraries:

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

  2. Read in the Adult Income Dataset (given as a .csv file) from the local directory and check the first five records:

    df = pd.read_csv("../datasets/adult_income_data.csv")

    df.head()

    Note

    The highlighted path must be changed based on the location of the file on your system.

    The output is as follows:

    Figure 4.76: DataFrame displaying the first five records from the .csv file

    Figure 4.76: DataFrame displaying the first five records from the .csv file

  3. Create a script that will read a text file line by line and extract the first line, which is the header of the .csv file:

    names = []

    with open('../datasets/adult_income_names.txt','r') as f:

        for line in f:

            f.readline()

            var=line.split(":")[0]

            names.append(var)

    names

    Note

    The highlighted path must be changed based on the location of the file on your system.

    The output is as follows:

    Figure 4.77: Names of the columns in the database

    Figure 4.77: Names of the columns in the database

  4. Add a name of Income for the response variable (last column) to the dataset by using the append command:

    names.append('Income')

  5. Read the new file again using the following command:

    df = pd.read_csv("../datasets/adult_income_data.csv", names=names)

    df.head()

    Note

    The highlighted path must be changed based on the location of the file on your system.

    The output is as follows:

    Figure 4.78: DataFrame with the income column added

    Figure 4.78: DataFrame with the income column added

  6. Use the describe command to get the statistical summary of the dataset:

    df.describe()

    The output is as follows:

    Figure 4.79: Statistical summary of the dataset

    Figure 4.79: Statistical summary of the dataset

    Note that only a small number of columns are included. Many variables in the dataset have multiple factors or classes.

  7. Make a list of all the variables in the classes by using the following command:

    # Make a list of all variables with classes

    vars_class = ['workclass','education','marital-status',

                  'occupation','relationship','sex','native-country']

  8. Create a loop to count and print them by using the following command:

    for v in vars_class:

        classes=df[v].unique()

        num_classes = df[v].nunique()

        print("There are {} classes in the "{}" column. "

              "They are: {}".format(num_classes,v,classes))

        print("-"*100)

    The output (partially shown) is as follows:

    Figure 4.80: Output of different factors or classes

    Figure 4.80: Output of different factors or classes

  9. Find the missing values by using the following command:

    df.isnull().sum()

    The output is as follows:

    Figure 4.81: Finding the missing values

    Figure 4.81: Finding the missing values

  10. Create a DataFrame with only age, education, and occupation by using subsetting:

    df_subset = df[['age','education', 'occupation']]

    df_subset.head()

    The output is as follows:

    Figure 4.82: Subset of the DataFrame

    Figure 4.82: Subset of the DataFrame

  11. Plot a histogram of age with a bin size of 20:

    df_subset['age'].hist(bins=20)

    The output is as follows:

    Figure 4.83: Histogram of age with a bin size of 20

    Figure 4.83: Histogram of age with a bin size of 20

  12. Plot box plots for age grouped by education (use a long figure size of 25x10 and make the x ticks font size 15):

    df_subset.boxplot(column='age',by='education',figsize=(25,10))

    plt.xticks(fontsize=15)

    plt.xlabel("Education",fontsize=20)

    plt.show()

    The output is as follows:

    Figure 4.84: Box plot of age grouped by education

    Figure 4.84: Box plot of age grouped by education

    Before doing any further operations, we need to use the apply method we learned about in this chapter. It turns out that when reading the dataset from the CSV file, all the strings came with a whitespace character in front of them. So, we need to remove that whitespace from all the strings.

  13. Create a function to strip the whitespace characters:

    def strip_whitespace(s):

        return s.strip()

  14. Use the apply method to apply this function to all the columns with string values, create a new column, copy the values from this new column to the old column, and drop the new column. This is the preferred method so that you don't accidentally delete valuable data. Most of the time, you will need to create a new column with a desired operation and then copy it back to the old column if necessary. Ignore any warning messages that are printed:

    # Education column

    df_subset['education_stripped'] = df['education']

                                      .apply(strip_whitespace)

    df_subset['education'] = df_subset['education_stripped']

    df_subset.drop(labels = ['education_stripped'],

                   axis=1,inplace=True)

    # Occupation column

    df_subset['occupation_stripped'] = df['occupation']

                                       .apply(strip_whitespace)

    df_subset['occupation'] = df_subset['occupation_stripped']

    df_subset.drop(labels = ['occupation_stripped'],

                   axis=1,inplace=True)

    This is the sample warning message, which you should ignore:

    Figure 4.85: Warning message to be ignored

    Figure 4.85: Warning message to be ignored

  15. Find the number of people who are aged between 30 and 50 (inclusive) by using the following command:

    # Conditional clauses and join them by & (AND)

    df_filtered=df_subset[(df_subset['age']>=30)

                          & (df_subset['age']<=50)]

  16. Check the contents of the new dataset:

    df_filtered.head()

    The output is as follows:

    Figure 4.86: Contents of the new DataFrame

    Figure 4.86: Contents of the new DataFrame

  17. Find the shape of the filtered DataFrame and specify the index of the tuple as 0 to return the first element:

    answer_1=df_filtered.shape[0]

    answer_1

    The output is as follows:

    16390

  18. Print the number of people aged between 30 and 50 using the following command:

    print("There are {} people of age between 30 and 50 "

          "in this dataset.".format(answer_1))

    The output is as follows:

    There are 16390 people of age between 30 and 50 in this dataset.

  19. Group by occupation and show the summary statistics of age. Find which profession has the oldest workers on average and which profession has its largest share of the workforce above the 75th percentile:

    df_subset.groupby('occupation').describe()['age']

    The output is as follows:

    Figure 4.87: DataFrame with data grouped by age and education

    Figure 4.87: DataFrame with data grouped by age and education

    The code returns 79 rows × 1 columns.

  20. Use subset and groupBy to find the outliers:

    occupation_stats=df_subset.groupby('occupation').describe()['age']

  21. Plot the values on a bar chart:

    plt.figure(figsize=(15,8))

    plt.barh(y=occupation_stats.index,

             width=occupation_stats['count'])

    plt.yticks(fontsize=13)

    plt.show()

    The output is as follows:

    Figure 4.88: Bar chart displaying occupation statistics

    Figure 4.88: Bar chart displaying occupation statistics

    Is there a particular occupation group that has very low representation? Perhaps we should remove those pieces of data because, with very low data, the group won't be useful in analysis. Just by looking at Figure 4.89, you should be able to see that the Armed-Forces group has only got a 9 count, that is, 9 data points. But how can we detect this? By plotting the count column in a bar chart. Note how the first argument to the barh function is the index of the DataFrame, which is the summary stats of the occupation groups. We can see that the Armed-Forces group has almost no data. This activity teaches you that, sometimes, the outlier is not just a value, but can be a whole group. The data of this group is fine, but it is too small to be useful for any analysis. So, it can be treated as an outlier in this case. But always use your business knowledge and engineering judgment for such outlier detection and how to process them. We will now practice merging two datasets using a common key.

  22. Suppose you are given two datasets where the common key is occupation. First, create two such disjoint datasets by taking random samples from the full dataset and then try merging. Include at least two other columns, along with the common key column for each dataset. Notice how the resulting dataset, after merging, may have more data points than either of the two starting datasets if your common key is not unique:

    df_1 = df[['age','workclass','occupation']]

              .sample(5,random_state=101)

    df_1.head()

    The output is as follows:

    Figure 4.89: Output after merging the common keys

    Figure 4.89: Output after merging the common keys

    The second dataset is as follows:

    df_2 = df[['education','occupation']].sample(5,random_state=101)

    df_2.head()

    The output is as follows:

    Figure 4.90: Output after merging the common keys

    Figure 4.90: Output after merging the common keys

  23. Merge the two datasets together:

    df_merged = pd.merge(df_1,df_2,on='occupation',

                         how='inner').drop_duplicates()

    df_merged

    The output is as follows:

    Figure 4.91: Output of distinct occupation values

Figure 4.91: Output of distinct occupation values

Note

To access the source code for this specific section, please refer to https://packt.live/37IamwR.

You can also run this example online at https://packt.live/2YhuF1j.

5. Getting Comfortable with Different Kinds of Data Sources

Activity 5.01: Reading Tabular Data from a Web Page and Creating DataFrames

Solution:

These are the steps to complete this activity:

  1. Import BeautifulSoup and load the data by using the following command:

    from bs4 import BeautifulSoup

    import pandas as pd

  2. Open the Wikipedia file by using the following command:

    fd = open("../datasets/List of countries by GDP (nominal) "

              "- Wikipedia.htm", "r", encoding = "utf-8")

    soup = BeautifulSoup(fd)

    fd.close()

    Note

    Don't forget to change the path of the dataset (highlighted) based on its location on your system

  3. Calculate the tables by using the following command:

    all_tables = soup.find_all("table")

    print("Total number of tables are {} ".format(len(all_tables)))

    There are nine tables in total.

  4. Find the right table using the class attribute by using the following command:

    data_table = soup.find("table", {"class": '"wikitable"|}'})

    print(type(data_table))

    The output is as follows:

    <class 'bs4.element.Tag'>

  5. Separate the source and the actual data by using the following command:

    sources = data_table.tbody.findAll('tr', recursive=False)[0]

    sources_list = [td for td in sources.findAll('td')]

    print(len(sources_list))

    The output is as follows:

    3

  6. Use the findAll function to find the data from the body tag of data_table using the following command:

    data = data_table.tbody.findAll('tr', recursive=False)[1]

                                    .findAll('td', recursive=False)

  7. Use the findAll function to find the data from the data_table td tag by using the following command:

    data_tables = []

    for td in data:

        data_tables.append(td.findAll('table'))

  8. Find the length of data_tables by using the following command:

    len(data_tables)

    The output is as follows:

    3

  9. Check how to get the source names by using the following command:

    source_names = [source.findAll('a')[0].getText()

                    for source in sources_list]

    print(source_names)

    The output is as follows:

    ['International Monetary Fund', 'World Bank', 'United Nations']

  10. Separate the header and data for the first source:

    header1 = [th.getText().strip() for th in

               data_tables[0][0].findAll('thead')[0].findAll('th')]

    header1

    The output is as follows:

    ['Rank', 'Country', 'GDP(US$MM)']

  11. Find the rows from data_tables using findAll:

    rows1 = data_tables[0][0].findAll('tbody')[0].findAll('tr')[1:]

  12. Find the data from rows1 using the strip function for each td tag:

    data_rows1 = [[td.get_text().strip() for td in

                   tr.findAll('td')] for tr in rows1]

  13. Find the DataFrame:

    df1 = pd.DataFrame(data_rows1, columns=header1)

    df1.head()

    The output is as follows:

    Figure 5.35: DataFrame created from the web page

    Figure 5.35: DataFrame created from the web page

  14. Do the same for the other two sources by using the following command:

    header2 = [th.getText().strip() for th in data_tables[1][0]

               .findAll('thead')[0].findAll('th')]

    header2

    The output is as follows:

    ['Rank', 'Country', 'GDP(US$MM)']

  15. Find the rows from data_tables using findAll by using the following command:

    rows2 = data_tables[1][0].findAll('tbody')[0].findAll('tr')

  16. Define find_right_text using the strip function by using the following command:

    def find_right_text(i, td):

        if i == 0:

            return td.getText().strip()

        elif i == 1:

            return td.getText().strip()

        else:

            index = td.text.find("♠")

            return td.text[index+1:].strip()

  17. Find the rows from data_rows using find_right_text by using the following command:

    data_rows2 = [[find_right_text(i, td) for i, td in

                   enumerate(tr.findAll('td'))] for tr in rows2]

  18. Calculate the df2 DataFrame by using the following command:

    df2 = pd.DataFrame(data_rows2, columns=header2)

    df2.head()

    The output is as follows:

    Figure 5.36: Output of the DataFrame

    Figure 5.36: Output of the DataFrame

  19. Now, perform the same operations for the third DataFrame by using the following command:

    header3 = [th.getText().strip() for th in data_tables[2][0]

               .findAll('thead')[0].findAll('th')]

    header3

    The output is as follows:

    ['Rank', 'Country', 'GDP(US$MM)']

  20. Find the rows from data_tables using findAll by using the following command:

    rows3 = data_tables[2][0].findAll('tbody')[0].findAll('tr')

  21. Find the rows from data_rows3 by using find_right_text:

    data_rows3 = [[find_right_text(i, td) for i, td in

                   enumerate(tr.findAll('td'))] for tr in rows2]

  22. Calculate the df3 DataFrame by using the following command:

    df3 = pd.DataFrame(data_rows3, columns=header3)

    df3.head()

    The output is as follows:

    Figure 5.37: The third DataFrame

Figure 5.37: The third DataFrame

Note

To access the source code for this specific section, please refer to https://packt.live/2NaCrDB.

You can also run this example online at https://packt.live/2YRAukP.

6. Learning the Hidden Secrets of Data Wrangling

Activity 6.01: Handling Outliers and Missing Data

Solution:

The steps to completing this activity are as follows:

Note

The dataset to be used for this activity can be found at https://packt.live/2YajrLJ.

  1. Load the data:

    import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    %matplotlib inline

  2. Read the .csv file:

    df = pd.read_csv("../datasets/visit_data.csv")

    Note

    Don't forget to change the path (highlighted) based on where the CSV file is saved on your system.

  3. Print the data from the DataFrame:

    df.head()

    The output is as follows:

    Figure 6.11: The contents of the CSV file

    Figure 6.11: The contents of the CSV file

    As we can see, there is data where some values are missing, and if we examine this, we will see some outliers.

  4. Check for duplicates by using the following command:

    print("First name is duplicated - {}"

          .format(any(df.first_name.duplicated())))

    print("Last name is duplicated - {}"

          .format(any(df.last_name.duplicated())))

    print("Email is duplicated - {}"

          .format(any(df.email.duplicated())))

    The output is as follows:

    First name is duplicated - True

    Last name is duplicated - True

    Email is duplicated - False

    There are duplicates in both the first and last names, which is normal. However, as we can see, there are no duplicates in email. That's good.

  5. Check whether any essential column contains NaN:

    """

    Notice that we have different ways to

    format boolean values for the % operator

    """

    print("The column Email contains NaN - %r " %

          df.email.isnull().values.any())

    print("The column IP Address contains NaN - %s " %

          df.ip_address.isnull().values.any())

    print("The column Visit contains NaN - %s " %

          df.visit.isnull().values.any())

    The output is as follows:

    The column Email contains NaN - False

    The column IP Address contains NaN - False

    The column Visit contains NaN - True

    The Visit column contains some NaN values. Given that the final task at hand will probably be predicting the number of visits, we cannot do anything with rows that do not have that information. They are a type of outlier. Let's get rid of them.

  6. Get rid of the outliers:

    """

    There are various ways to do this. This is just one way. We encourage you to explore other ways. But before that we need to store the previous size of the data set and we will compare it with the new size

    """

    size_prev = df.shape

    df = df[np.isfinite(df['visit'])]

    #This is an inplace operation.

    # After this operation the original DataFrame is lost.

    size_after = df.shape

  7. Report the size difference:

    # Notice how parameterized format is used.

    # Then, the indexing is working inside the quote marks

    print("The size of previous data was - {prev[0]} rows and "

          "the size of the new one is - {after[0]} rows"

          .format(prev=size_prev, after=size_after))

    The output is as follows:

    The size of previous data was - 1000 rows and the size of the new one is - 974 rows

  8. Plot a box plot to find whether the data has outliers:

    plt.boxplot(df.visit, notch=True)

    The box plot is as follows:

    Figure 6.12: Box plot using the data

    Figure 6.12: Box plot using the data

    As we can see, we have data in this column in the interval (0, 3000). However, the main concentration of the data is between ~700 and ~2300.

  9. Get rid of values beyond 2900 and below 100 – these are outliers for us. We need to get rid of them:

    df1 = df[(df['visit'] <= 2900) & (df['visit'] >= 100)]

    # Notice the powerful & operator

    """

    Here we abuse the fact the

    number of variable can be greater

    than the number of replacement targets

    """

    print("After getting rid of outliers the new size of the data "

          "is - {}".format(*df1.shape))

    The output is as follows:

    After getting rid of outliers the new size of the data is - 923

    Note

    To access the source code for this specific section, please refer to https://packt.live/2AFcSbn.

    You can also run this example online at https://packt.live/3fAL9qY

7. Advanced Web Scraping and Data Gathering

Activity 7.01: Extracting the Top 100 e-books from Gutenberg

Solution:

These are the steps to complete this activity:

  1. Import the necessary libraries, including regex and BeautifulSoup:

    import urllib.request, urllib.parse, urllib.error

    import requests

    from bs4 import BeautifulSoup

    import ssl

    import re

  2. Read the HTML from the URL:

    top100url = 'https://www.gutenberg.org/browse/scores/top'

    response = requests.get(top100url)

  3. Write a small function to check the status of the web request:

    def status_check(r):

        if r.status_code==200:

            print("Success!")

            return 1

        else:

            print("Failed!")

            return -1

  4. Check the status of response:

    status_check(response)

    The output is as follows:

    Success!

    1

  5. Decode the response and pass it on to BeautifulSoup for HTML parsing:

    contents = response.content.decode(response.encoding)

    soup = BeautifulSoup(contents, 'html.parser')

  6. Find all the href tags and store them in the list of links.

    # Empty list to hold all the http links in the HTML page

    lst_links=[]

    # Find all href tags and store them in the list of links

    for link in soup.find_all('a'):

        #print(link.get('href'))

        lst_links.append(link.get('href'))

  7. Check what the list looks like – print the first 30 elements:

    lst_links[:30]

    The output (partially shown) is as follows:

    ['/wiki/Main_Page',

    '/catalog/',

    '/ebooks/',

    '/browse/recent/last1',

    '/browse/scores/top',

    '/wiki/Gutenberg:Offline_Catalogs',

    '/catalog/world/mybookmarks',

    '/wiki/Main_Page',

    'https://www.paypal.com/xclick/business=donate%40gutenberg.org&item_name=Donation+to+Project+Gutenberg',

    '/wiki/Gutenberg:Project_Gutenberg_Needs_Your_Donation',

    'http://www.ibiblio.org',

    'http://www.pgdp.net/',

    'pretty-pictures',

    '#books-last1',

    '#authors-last1',

    '#books-last7',

    '#authors-last7',

    '#books-last30',

    '#authors-last30',

    '/ebooks/1342',

    '/ebooks/84',

    '/ebooks/1080',

    '/ebooks/46',

    '/ebooks/219',

    '/ebooks/2542',

    '/ebooks/98',

    '/ebooks/345',

    '/ebooks/2701',

    '/ebooks/844',

    '/ebooks/11']

  8. Use a regular expression to find the numeric digits in these links. These are the file numbers for the top 100 books. Initialize the empty list to hold the file numbers:

    booknum=[]

    Numbers 19 to 118 in the original list of links have the top 100 eBooks' numbers.

  9. Loop over the appropriate range and use a regex to find the numeric digits in the link (href) string. Use the findall() method:

    for i in range(19,119):

        link=lst_links[i]

        link=link.strip()

        """

        Regular expression to find the numeric digits in the link (href) string

        """

        n=re.findall('[0-9]+',link)

        if len(n)==1:

            # Append the filenumber casted as integer

            booknum.append(int(n[0]))

  10. Print the file numbers:

    print(" The file numbers for the top 100 ebooks",

          "on Gutenberg are shown below "+"-"*70)

    print(booknum)

    The output is as follows:

    The file numbers for the top 100 ebooks on Gutenberg are shown below

    ----------------------------------------------------------------------

    [1342, 84, 1080, 46, 219, 2542, 98, 345, 2701, 844, 11, 5200,

    43, 16328, 76, 74, 1952, 6130, 2591, 1661, 41, 174, 23, 1260,

    1497, 408, 3207, 1400, 30254, 58271, 1232, 25344, 58269, 158,

    44881, 1322, 205, 2554, 1184, 2600, 120, 16, 58276, 5740, 34901,

    28054, 829, 33, 2814, 4300, 100, 55, 160, 1404, 786, 58267, 3600,

    19942, 8800, 514, 244, 2500, 2852, 135, 768, 58263, 1251, 3825,

    779, 58262, 203, 730, 20203, 35, 1250, 45, 161, 30360, 7370,

    58274, 209, 27827, 58256, 33283, 4363, 375, 996, 58270, 521,

    58268, 36, 815, 1934, 3296, 58279, 105, 2148, 932, 1064, 13415]

    Note

    Since the list of top 100 books is frequently updated, the output you get will vary.

    What does the soup object's text look like?

  11. Use the .text method and print only the first 2,000 characters (do not print the whole thing as it is too long).

    You will notice a lot of empty spaces/blanks here and there. Ignore them. They are part of the HTML page's markup and its whimsical nature:

    print(soup.text[:2000])

    The output is as follows:

    if (top != self) {

             top.location.replace (http://www.gutenberg.org);

             alert ('Project Gutenberg is a FREE service with NO membership required. If you paid somebody else to get here, make them give you your money back!');

             }

        Top 100 - Project Gutenberg

    Online Book Catalog

    Book  Search

    -- Recent  Books

    -- Top  100

    -- Offline Catalogs

    -- My Bookmarks

    Main Page

    Pretty Pictures

    Top 100 EBooks yesterday —

      Top 100 Authors yesterday —

      Top 100 EBooks last 7 days —

      Top 100 Authors last 7 days —

      Top 100 EBooks last 30 days —

      Top 100 Authors last 30 days

    Top 100 EBooks yesterday

    Pride and Prejudice by Jane Austen (1826)

    Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (1367)

    A Modest Proposal by Jonathan Swift (1020)

    A Christmas Carol in Prose; Being a Ghost Story of Christmas by Charles Dickens (953)

    Heart of Darkness by Joseph Conrad (887)

    Et dukkehjem. English by Henrik Ibsen (761)

    A Tale of Two Cities by Charles Dickens (741)

    Dracula by Bram Stoker (732)

    Moby Dick; Or, The Whale by Herman Melville (651)

    The Importance of Being Earnest: A Trivial Comedy for Serious People by Oscar Wilde (646)

    Alice's Adventures in Wonderland by Lewis Carrol

  12. Search the extracted text (using regex) from the soup object to find the names of the top 100 eBooks (yesterday's ranking):

    lst_titles_temp=[]

  13. Create a starting index. It should point at the text Top 100 Ebooks yesterday. Use the splitlines method of soup.text. It splits the lines of the text of the soup object:

    start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')

    Note

    Since the list of top 100 books is frequently updated, the output you get will vary.

  14. Run the for loop from 1-100 to add the strings of the next 100 lines to this temporary list. Hint: use the splitlines method:

    for i in range(100):

        lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])

  15. Use regex to extract only text from the name strings and append them to an empty list. Use match and span to find the indices and use them:

    lst_titles=[]

    for i in range(100):

        id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()

        lst_titles.append(lst_titles_temp[i][id1:id2])

  16. Print the list of titles:

    for l in lst_titles:

        print(l)

    The partial output is as follows:

    Pride and Prejudice by Jane Austen

    Frankenstein

    A Modest Proposal by Jonathan Swift

    A Christmas Carol in Prose

    Heart of Darkness by Joseph Conrad

    Et dukkehjem

    A Tale of Two Cities by Charles Dickens

    Dracula by Bram Stoker

    Moby Dick

    The Importance of Being Earnest

    Alice

    Metamorphosis by Franz Kafka

    The Strange Case of Dr

    Beowulf

    The Russian Army and the Japanese War

    Calculus Made Easy by Silvanus P

    Beyond Good and Evil by Friedrich Wilhelm Nietzsche

    An Occurrence at Owl Creek Bridge by Ambrose Bierce

    Don Quixote by Miguel de Cervantes Saavedra

    Blue Jackets by Edward Greey

    The Life and Adventures of Robinson Crusoe by Daniel Defoe

    The Waterloo Campaign

    The War of the Worlds by H

    Democracy in America

    Songs of Innocence

    The Confessions of St

    Modern French Masters by Marie Van Vorst

    Persuasion by Jane Austen

    The Works of Edgar Allan Poe

    The Fall of the House of Usher by Edgar Allan Poe

    The Masque of the Red Death by Edgar Allan Poe

    The Lady with the Dog and Other Stories by Anton Pavlovich Chekhov

    Note

    Since the list of top 100 books is frequently updated, the output you get will vary.

Here, we have seen how we can use web scraping and parsing using a mix of BeautifulSoup and regex to find information from very untidy and vast source data. These are some essential steps that you will have to perform on a daily basis when you are dealing with data wrangling.

Note

To access the source code for this specific section, please refer to https://packt.live/2BltmFo.

You can also run this example online at https://packt.live/37FdLwD.

Activity 7.02: Building Your Own Movie Database by Reading an API

SOLUTION

Note

Before you begin, ensure that you modify the APIkeys.json file and add your secret API key there. Link to the file: https://packt.live/2CmIpze.

These are the steps to complete this activity:

  1. Import urllib.request, urllib.parse, urllib.error, and json:

    import urllib.request, urllib.parse, urllib.error

    import json

  2. Load the secret API key (you have to get one from the OMDb website and use that; it has a daily API key limit of 1,000) from a JSON file, stored in the same folder, into a variable, by using json.loads():

    Note

    The following cell will not be executed in the solution notebook because the author cannot give out their private API key.

    The students/users will need to obtain a key and store it in a JSON file. We are calling this file APIkeys.json.

  3. Open the APIkeys.json file by using the following command:

    with open('APIkeys.json') as f:

        keys = json.load(f)

        omdbapi = keys['OMDBapi']

    The final URL to be passed should look like this: http://www.omdbapi.com/?t=movie_name&apikey=secretapikey.

  4. Assign the OMDb portal (http://www.omdbapi.com/?) as a string to a variable called serviceurl by using the following command:

    serviceurl = 'http://www.omdbapi.com/?'

  5. Create a variable called apikey with the last portion of the URL (&apikey=secretapikey), where secretapikey is your own API key. The movie name portion is t=movie_name, and it will be addressed later:

    apikey = '&apikey='+omdbapi

  6. Write a utility function called print_json to print the movie data from a JSON file (which we will get from the portal). Here are the keys of a JSON file: 'Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language','Country', 'Awards', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', and 'imdbID':

    def print_json(json_data):

        list_keys = ['Title', 'Year', 'Rated', 'Released',

                     'Runtime', 'Genre', 'Director', 'Writer',

                     'Actors', 'Plot', 'Language', 'Country',

                     'Awards', 'Ratings','Metascore', 'imdbRating',

                     'imdbVotes', 'imdbID']

        print("-"*50)

        for k in list_keys:

            if k in list(json_data.keys()):

                print(f"{k}: {json_data[k]}")

        print("-"*50)

  7. Write a utility function to download a poster of the movie based on the information from the JSON dataset and save it in your local folder. Use the os module. The poster data is stored in the JSON key Poster. You may want to split the name of the Poster file and extract the file extension only. Let's say that the extension is .jpg. We could later join this extension to the movie name and create a filename such as movie.jpg. Use the open Python command open to open a file and write the poster data. Close the file after you're done. This function may not return anything. It just saves the poster data as an image file:

    def save_poster(json_data):

        import os

        title = json_data['Title']

        poster_url = json_data['Poster']

        """

        Splits the poster url by '.' and

        picks up the last string as file extension

        """

        poster_file_extension=poster_url.split('.')[-1]

        # Reads the image file from web

        poster_data = urllib.request.urlopen(poster_url).read()

        savelocation=os.getcwd()+''+'Posters'+''

        """

        Creates new directory if the directory does not exist.

        Otherwise, just use the existing path.

        """

        if not os.path.isdir(savelocation):

            os.mkdir(savelocation)

        filename=savelocation+str(title)

                 +'.'+poster_file_extension

        f=open(filename,'wb')

        f.write(poster_data)

        f.close()

  8. Write a utility function called search_movie to search a movie by its name, print the downloaded JSON data (use the print_json function for this), and save the movie poster in the local folder (use the save_poster function for this). Use a try-except loop for this, that is, try to connect to the web portal. If successful, proceed, but if not (that is, if an exception is raised), then just print an error message. Use the previously created variables, serviceurl and apikey. You have to pass on a dictionary with a key, t, and the movie name as the corresponding value to the urllib.parse.urlencode function and then add the serviceurl and apikey variables to the output of the function to construct the full URL. This URL will be used to access the data. The JSON data has a key called Response. If it is True, that means that the read was successful. Check this before processing the data. If it was not successful, then print the JSON key Error, which will contain the appropriate error message that's returned by the movie database:

    def search_movie(title):

        try:

            url = serviceurl

                  + urllib.parse.urlencode({'t':str(title)})+apikey

            print(f'Retrieving the data of "{title}" now... ')

            print(url)

            uh = urllib.request.urlopen(url)

            data = uh.read()

            json_data=json.loads(data)

            if json_data['Response']=='True':

                print_json(json_data)

                """

                Asks user whether to download the poster of the movie

                """

                if json_data['Poster']!='N/A':

                    save_poster(json_data)

                else:

                    print("Error encountered: ", json_data['Error'])

        except urllib.error.URLError as e:

            print(f"ERROR: {e.reason}")

  9. Test the search_movie function by entering Titanic:

    search_movie("Titanic")

    The following is the retrieved data for Titanic:

    http://www.omdbapi.com/?t=Titanic&apikey=<your api key>

    --------------------------------------------------

    Title: Titanic

    Year: 1997

    Rated: PG-13

    Released: 19 Dec 1997

    Runtime: 194 min

    Genre: Drama, Romance

    Director: James Cameron

    Writer: James Cameron

    Actors: Leonardo DiCaprio, Kate Winslet, Billy Zane, Kathy Bates

    Plot: A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic.

    Language: English, Swedish

    Country: USA

    Awards: Won 11 Oscars. Another 111 wins & 77 nominations.

    Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.8/10'}, {'Source': 'Rotten Tomatoes', 'Value': '89%'}, {'Source': 'Metacritic', 'Value': '75/100'}]

    Metascore: 75

    imdbRating: 7.8

    imdbVotes: 913,780

    imdbID: tt0120338

    --------------------------------------------------

  10. Test the search_movie function by entering Random_error (obviously, this will not be found, and you should be able to check whether your error-catching code is working properly):

    search_movie("Random_error")

    Retrieve the data of Random_error:

    Retrieving the data of "Random_error" now...

    http://www.omdbapi.com/?t=Random_error&apikey=<your api key>

    Error encountered: Movie not found!

    Note

    In the last two steps, we've not shown the private API key (highlighted) for security reasons.

    To access the source code for this specific section, please refer to https://packt.live/3hLJvoy.

    You can also run this example online at https://packt.live/3efkDTZ.

8. RDBMS and SQL

Activity 8.01: Retrieving Data Accurately from Databases

Solution:

These are the steps to complete this activity:

  1. Connect to the supplied petsdb database:

    import sqlite3

    conn = sqlite3.connect("petsdb")

  2. Write a function to check whether the connection has been successful:

    # a tiny function to make sure the connection is successful

    def is_opened(conn):

        try:

            conn.execute("SELECT * FROM persons LIMIT 1")

            return True

        except sqlite3.ProgrammingError as e:

            print("Connection closed {}".format(e))

            return False

    print(is_opened(conn))

    The output is as follows:

    True

  3. Close the connection:

    conn.close()

  4. Check whether the connection is open or closed:

    print(is_opened(conn))

    The output is as follows:

    Connection closed Cannot operate on a closed database.

    False

  5. Connect to the petsdb database:

    conn = sqlite3.connect("petsdb")

    c = conn.cursor()

  6. Find out the different age groups in the persons table. Execute the following command:

    for ppl, age in c.execute("SELECT count(*),

                              age FROM persons GROUP BY age"):

        print("We have {} people aged {}".format(ppl, age))

    The output is as follows:

    Figure 8.13: Section of output grouped by age

    Figure 8.13: Section of output grouped by age

  7. To find out which age group has the maximum number of people, execute the following command:

    for ppl, age in c.execute("SELECT count(*), age FROM persons

                              GROUP BY age ORDER BY count(*)DESC"):

        print("The highest number of people is {} and "

              "came from {} age group".format(ppl, age))

        break

    The output is as follows:

    The highest number of people is 5 and came from 73 age group

  8. To find out how many people do not have a full name (the last name is blank/null), execute the following command:

    res = c.execute("SELECT count(*) FROM persons

                    WHERE last_name IS null")

    for row in res:

        print(row)

    The output is as follows:

    (60,)

  9. To find out how many people have more than one pet, execute the following command:

    res = c.execute("SELECT count(*) FROM

                    (SELECT count(owner_id) FROM pets

                     GROUP BY owner_id HAVING count(owner_id) >1)")

    for row in res:

        print("{} people have more than one pets".format(row[0]))

    The output is as follows:

    43 People have more than one pets

  10. To find out how many pets have received treatment, execute the following command:

    res = c.execute("SELECT count(*) FROM pets

                    WHERE treatment_done=1")

    for row in res:

        print(row)

    The output is as follows:

    (36,)

  11. To find out how many pets have received treatment and the type of pet is known, execute the following command:

    res = c.execute("SELECT count(*) FROM pets

                    WHERE treatment_done=1 AND pet_type IS NOT null")

    for row in res:

        print(row)

    The output is as follows:

    (16,)

  12. To find out how many pets are from the city called east port, execute the following command:

    res = c.execute("SELECT count(*) FROM pets

                    JOIN persons ON pets.owner_id = persons.id

                    WHERE persons.city='east port'")

    for row in res:

        print(row)

    The output is as follows:

    (49,)

  13. To find out how many pets are from the city called east port and who received treatment, execute the following command:

    res = c.execute("SELECT count(*) FROM pets

                    JOIN persons ON pets.owner_id =

                    persons.id WHERE persons.city='east port'

                    AND pets.treatment_done=1")

    for row in res:

        print(row)

    The output is as follows:

    (11,)

    Note

    To access the source code for this specific section, please refer to https://packt.live/3derN9D.

    You can also run this example online at https://packt.live/2ASWYKi.

9. Applications in Business Use Cases and Conclusion of the Course

Activity 9.01: Data Wrangling Task – Fixing UN Data

Solution:

These are the steps to complete this activity:

  1. Import the required libraries:

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    import warnings

    warnings.filterwarnings('ignore')

  2. Save the URL of the dataset (highlighted) and use the pandas read_csv method to directly pass this link and create a DataFrame:

    education_data_link="http://data.un.org/_Docs/SYB/CSV/"

                        "SYB61_T07_Education.csv"

    df1 = pd.read_csv(education_data_link)

  3. Print the data in the DataFrame:

    df1.head()

    The output (partially shown) is as follows:

    Figure 9.7: Partial DataFrame from the UN data

    Figure 9.7: Partial DataFrame from the UN data

  4. As the first row does not contain useful information, use the skiprows parameter to remove the first row:

    df1 = pd.read_csv(education_data_link,skiprows=1)

  5. Print the data in the DataFrame:

    df1.head()

    The output is as follows:

    Figure 9.8: Partial  DataFrame after removing the first row

    Figure 9.8: Partial DataFrame after removing the first row

  6. Drop the Region/Country/Area and Source columns as they will not be very helpful:

    df2 = df1.drop(['Region/Country/Area','Source'],axis=1)

  7. Assign the following names as the columns of the DataFrame: ['Region/Country/Area','Year','Data','Value','Footnotes']

    df2.columns=['Region/Country/Area','Year','Data',

                 'Enrollments (Thousands)','Footnotes']

  8. Print the data in the DataFrame:

    df2.head()

    The output is as follows:

    Figure 9.9: DataFrame after dropping the Region/Country/Area and Source columns

    Figure 9.9: DataFrame after dropping the Region/Country/Area and Source columns

  9. Check how many unique values the Footnotes column contains:

    df2['Footnotes'].unique()

    The output is as follows:

    Figure 9.10: Unique values of the Footnotes column

    Figure 9.10: Unique values of the Footnotes column

  10. Convert the Value column data into a numeric one for further processing:

    type(df2['Enrollments (Thousands)'][0])

    The output is as follows:

    str

  11. Create a utility function to convert the strings in the Value column into floating-point numbers:

    def to_numeric(val):

        """

        Converts a given string (with one or more commas) to a numeric value

        """

        if ',' not in str(val):

            result = float(val)

        else:

            val=str(val)

            val=''.join(str(val).split(','))

            result=float(val)

        return result

  12. Use the apply method to apply this function to the Value column data:

    df2['Enrollments (Thousands)']=df2['Enrollments (Thousands)']

                                   .apply(to_numeric)

  13. Print the unique types of data in the Data column:

    df2['Data'].unique()

    The output is as follows:

    Figure 9.11: Unique values in a column

    Figure 9.11: Unique values in a column

  14. Create three DataFrames by filtering and selecting them from the original DataFrame:

    df_primary: Only students enrolled in primary education (thousands)

    df_secondary: Only students enrolled in secondary education (thousands)

    df_tertiary: Only students enrolled in tertiary education (thousands):

    df_primary = df2[df2['Data']=='Students enrolled in primary '

                                  'education (thousands)']

    df_secondary = df2[df2['Data']=='Students enrolled in secondary '

                                    'education (thousands)']

    df_tertiary = df2[df2['Data']=='Students enrolled in tertiary '

                                   'education (thousands)']

  15. Compare them using bar charts of the primary students' enrollment of a low-income country and a high-income country:

    primary_enrollment_india = df_primary[df_primary

                               ['Region/Country/Area']=='India']

    primary_enrollment_USA = df_primary[df_primary

                             ['Region/Country/Area']

                             =='United States of America']

  16. Print the primary_enrollment_india data:

    primary_enrollment_india

    The output is as follows:

    Figure 9.12: Data for enrollment in primary education in India

    Figure 9.12: Data for enrollment in primary education in India

  17. Print the primary_enrollment_USA data:

    primary_enrollment_USA

    The output is as follows:

    Figure 9.13: Data for enrollment in primary education in USA

    Figure 9.13: Data for enrollment in primary education in USA

  18. Plot the data for India:

    plt.figure(figsize=(8,4))

    plt.bar(primary_enrollment_india['Year'],

    primary_enrollment_india['Enrollments (Thousands)'])

    plt.title("Enrollment in primary education in India "

              "(in thousands)",fontsize=16)

    plt.grid(True)

    plt.xticks(fontsize=14)

    plt.yticks(fontsize=14)

    plt.xlabel("Year", fontsize=15)

    plt.show()

    The output is as follows:

    Figure 9.14: Bar plot for enrollment in primary education in India

    Figure 9.14: Bar plot for enrollment in primary education in India

  19. Plot the data for the USA:

    plt.figure(figsize=(8,4))

    plt.bar(primary_enrollment_USA['Year'],

    primary_enrollment_USA['Enrollments (Thousands)'])

    plt.title("Enrollment in primary education in the "

              "United States of America (in thousands)",fontsize=16)

    plt.grid(True)

    plt.xticks(fontsize=14)

    plt.yticks(fontsize=14)

    plt.xlabel("Year", fontsize=15)

    plt.show()

    The output is as follows:

    Figure 9.15: Bar plot for enrollment in primary education in the USA

    Figure 9.15: Bar plot for enrollment in primary education in the USA

    As we can see, we have missing data. Now is the time to use pandas methods to do the data imputation. But to do that, we need to create a DataFrame with missing values inserted into it – that is, we need to append another DataFrame with missing values to the current DataFrame.

  20. Find the missing years:

    missing_years = [y for y in range(2004,2010)]

                    +[y for y in range(2011,2014)]

  21. Print the value in the missing_years variable:

    missing_years

    The output is as follows:

    [2004, 2005, 2006, 2007, 2008, 2009, 2011, 2012, 2013]

  22. Create a dictionary of values with np.nan. Note that there are nine missing data points, so we need to create a list with identical values repeated nine times:

    dict_missing =

    {'Region/Country/Area':['India']*9,

     'Year':missing_years,

     'Data':'Students enrolled in primary education(thousands)'*9,

     'Enrollments (Thousands)':[np.nan]*9,'Footnotes':[np.nan]*9}

  23. Create a DataFrame of missing values (from the preceding dictionary) that we can append:

    df_missing = pd.DataFrame(data=dict_missing)

  24. Append the new DataFrames to the previously existing ones:

    primary_enrollment_india=primary_enrollment_india

                             .append(df_missing,ignore_index=True,

                                     sort=True)

  25. Print the data in primary_enrollment_india:

    primary_enrollment_india

    The output is as follows:

    Figure 9.16: Partial Data for enrollment in primary education in India after appending 
the data

    Figure 9.16: Partial Data for enrollment in primary education in India after appending the data

  26. Sort by year and reset the indices using reset_index. Use inplace=True to execute the changes on the DataFrame itself:

    primary_enrollment_india.sort_values(by='Year',inplace=True)

    primary_enrollment_india.reset_index(inplace=True,drop=True)

  27. Print the data in primary_enrollment_india:

    primary_enrollment_india

    The output is as follows:

    Figure 9.17: Partial Data for enrollment in primary education in India after sorting the data

    Figure 9.17: Partial Data for enrollment in primary education in India after sorting the data

  28. Use the interpolate method for linear interpolation. It fills all the NaN values with linearly interpolated values. Check out this link for more details about this method: http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.interpolate.html:

    primary_enrollment_india.interpolate(inplace=True)

  29. Print the data in primary_enrollment_india:

    primary_enrollment_india

    The output is as follows:

    Figure 9.18: Data for enrollment in primary education in India after interpolating the data

    Figure 9.18: Data for enrollment in primary education in India after interpolating the data

  30. Plot the data:

    plt.figure(figsize=(8,4))

    plt.bar(primary_enrollment_india['Year'],

            primary_enrollment_india['Enrollments (Thousands)'])

    plt.title("Enrollment in primary education in India "

              "(in thousands)", fontsize=16)

    plt.grid(True)

    plt.xticks(fontsize=14)

    plt.yticks(fontsize=14)

    plt.xlabel("Year", fontsize=15)

    plt.show()

    The output is as follows:

    Figure 9.19: Bar plot for enrollment in primary education in India

    Figure 9.19: Bar plot for enrollment in primary education in India

  31. Repeat the same steps for the USA:

    missing_years = [2004]+[y for y in range(2006,2010)]

                    +[y for y in range(2011,2014)]+[2016]

  32. Print the value in missing_years.

    missing_years

    The output is as follows:

    [2004, 2006, 2007, 2008, 2009, 2011, 2012, 2013, 2016]

  33. Create dict_missing, as follows:

    dict_missing =

    {'Region/Country/Area':['United States of America']*9,

     'Year':missing_years,

     'Data':'Students enrolled in primary education (thousands)'*9,

     'Value':[np.nan]*9,'Footnotes':[np.nan]*9}

  34. Create the DataFrame for df_missing, as follows:

    df_missing = pd.DataFrame(data=dict_missing)

  35. Append this to the primary_enrollment_USA variable, as follows:

    primary_enrollment_USA=primary_enrollment_USA

                           .append(df_missing,

                                   ignore_index =True,sort=True)

  36. Sort the values in the primary_enrollment_USA variable, as follows:

    primary_enrollment_USA.sort_values(by='Year',inplace=True)

  37. Reset the index of the primary_enrollment_USA variable, as follows:

    primary_enrollment_USA.reset_index(inplace=True,drop=True)

  38. Interpolate the primary_enrollment_USA variable, as follows:

    primary_enrollment_USA.interpolate(inplace=True)

  39. Print the primary_enrollment_USA variable:

    primary_enrollment_USA

    The output is as follows:

    Figure 9.20: Data for enrollment in primary education in the USA after all operations have been completed

    Figure 9.20: Data for enrollment in primary education in the USA after all operations have been completed

  40. Still, the first value is unfilled. We can use the limit and limit_direction parameters with the interpolate method to fill it in. How did we know this? By searching on Google and looking at the StackOverflow page. Always search for the solution to your problem and look for what has already been done and try to implement it:

    primary_enrollment_USA.interpolate(method='linear',

                                       limit_direction='backward',

                                       limit=1)

    The output is as follows:

    Figure 9.21: Data for enrollment in primary education in the USA after limiting the data

    Figure 9.21: Data for enrollment in primary education in the USA after limiting the data

  41. Print the data in primary_enrollment_USA:

    primary_enrollment_USA

    The output is as follows:

    Figure 9.22: Data for enrollment in primary education in the USA

    Figure 9.22: Data for enrollment in primary education in the USA

  42. Plot the data:

    plt.figure(figsize=(8,4))

    plt.bar(primary_enrollment_USA['Year'],

            primary_enrollment_USA['Enrollments (Thousands)'])

    plt.title("Enrollment in primary education in the "

              "United States of America (in thousands)",fontsize=16)

    plt.grid(True)

    plt.xticks(fontsize=14)

    plt.yticks(fontsize=14)

    plt.xlabel("Year", fontsize=15)

    plt.show()

    The output is as follows:

    Figure 9.23: Bar plot for enrollment in primary education in the USA

Figure 9.23: Bar plot for enrollment in primary education in the USA

Note

To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.

You can also run this example online at https://packt.live/3fQ0PXJ.

Activity 9.02: Data Wrangling Task – Cleaning GDP Data

Solution:

These are the steps to complete this activity:

  1. Import the required libraries:

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    import warnings

    warnings.filterwarnings('ignore')

  2. GDP data for India: We will try to read the GDP data for India from a CSV file that was found in a World Bank portal. It is given to you and also hosted on the Packt GitHub repository at https://packt.live/2AMoeu6. However, the pandas read_csv method will throw an error if we try to read it normally. Let's look at a step-by-step guide of how we can read useful information from it:

    df3=pd.read_csv("../datasets/India_World_Bank_Info.csv")

    Note

    Throughout this activity, don't forget to change the path of the dataset (highlighted) to match its location on your system.

    The output (partially shown) is as follows:

    ---------------------------------------------------------------------------

    ParserError Traceback (most recent call last)

    <ipython-input-45-9239cae67df7> in <module>()

    ...

    ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3

    We can try and use the error_bad_lines=False option in this kind of situation.

  3. Read the India World Bank Information .csv file.

    df3=pd.read_csv("../datasets/India_World_Bank_Info.csv",

                    error_bad_lines=False)

    The output (partially shown) will be:

    Figure 9.24: Partial output of the warnings.

    Figure 9.24: Partial output of the warnings.

  4. Then, let's take a look at the contents of the DataFrame.

    df3.head(10)

    The output is as follows:

    Figure 9.25: DataFrame from the India World Bank Information

    Figure 9.25: DataFrame from the India World Bank Information

    Note

    At times, the output may not be found because there are three rows instead of the expected one row.

  5. Clearly, the delimiter in this file is tab ( ):

    df3=pd.read_csv("../datasets/India_World_Bank_Info.csv",

                    error_bad_lines=False,delimiter=' ')

    df3.head(10)

    The output is as follows:

    Figure 9.26: Partial output of the DataFrame from the India World Bank Information after using a delimiter

    Figure 9.26: Partial output of the DataFrame from the India World Bank Information after using a delimiter

  6. Use the skiprows parameter to skip the first four rows:

    df3=pd.read_csv("../datasets/India_World_Bank_Info.csv",

                    error_bad_lines=False,delimiter=' ',

                    skiprows=4)

    df3.head(10)

    The output is as follows:

    Figure 9.27: Partial output of DataFrame from the India World Bank Information after 
using skiprows

    Figure 9.27: Partial output of DataFrame from the India World Bank Information after using skiprows

    Closely examine the dataset. In this file, the columns are the yearly data and the rows are the various types of information. Upon examining the file with Excel, we find that the Indicator Name column is the one with the name of a particular data type, which is GDP per capita. We filter the dataset with the information we are interested in and also transpose (the rows and columns are interchanged) it to put it in a similar format to what our previous education dataset was in:

    df4=df3[df3['Indicator Name']=='GDP per capita (current US$)'].T

    df4.head(10)

    The output is as follows:

    Figure 9.28: DataFrame focusing on GDP per capita

    Figure 9.28: DataFrame focusing on GDP per capita

  7. There is no index, so let's use reset_index again:

    df4.reset_index(inplace=True)

    df4.head(10)

    The output is as follows:

    Figure 9.29: DataFrame from the India World Bank Information using reset_index

    Figure 9.29: DataFrame from the India World Bank Information using reset_index

  8. The first three rows aren't useful. We can redefine the DataFrame without them. Then, we re-index again:

    df4.drop([0,1,2],inplace=True)

    df4.reset_index(inplace=True,drop=True)

    df4.head(10)

    The output is as follows:

    Figure 9.30: DataFrame from the India World Bank Information after dropping and resetting the index

    Figure 9.30: DataFrame from the India World Bank Information after dropping and resetting the index

  9. Let's rename the columns properly (this is necessary for merging, which we will look at shortly):

    df4.columns=['Year','GDP']

    df4.head(10)

    The output is as follows:

    Figure 9.31: DataFrame focusing on Year and GDP

    Figure 9.31: DataFrame focusing on Year and GDP

  10. It looks like we have GDP data from 1960 onward. However, we are only interested in 2003 – 2016. Let's examine the last 20 rows:

    df4.tail(20)

    The output is as follows:

    Figure 9.32: DataFrame from the India World Bank Information

    Figure 9.32: DataFrame from the India World Bank Information

  11. So, we should be good with rows 43-56. Let's create a DataFrame called df_gdp:

    df_gdp=df4.iloc[[i for i in range(43,57)]]

    df_gdp

    The output is as follows:

    Figure 9.33: DataFrame from the India World Bank Information

    Figure 9.33: DataFrame from the India World Bank Information

  12. We need to reset the index again (for merging):

    df_gdp.reset_index(inplace=True,drop=True)

    df_gdp

    The output is as follows:

    Figure 9.34: DataFrame from the India World Bank Information

    Figure 9.34: DataFrame from the India World Bank Information

  13. The year in this DataFrame is not of the int type. So, it will have problems merging with the education DataFrame:

    df_gdp['Year']

    The output is as follows:

    Figure 9.35: DataFrame focusing on year

    Figure 9.35: DataFrame focusing on year

  14. Use the apply method with Python's built-in int function. Ignore any warnings that are thrown:

    df_gdp['Year']=df_gdp['Year'].apply(int)

    Note

    To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.

    You can also run this example online at https://packt.live/3fQ0PXJ.

Activity 9.03: Data Wrangling Task – Merging UN Data and GDP Data

Solution:

These are the steps to complete this activity:

  1. Now, merge the two DataFrames, that is, primary_enrollment_india and df_gdp, on the Year column:

    primary_enrollment_with_gdp=

    primary_enrollment_india.merge(df_gdp,on='Year')

    primary_enrollment_with_gdp

    The output is as follows:

    Figure 9.36: Merged data

    Figure 9.36: Merged data

  2. Now, we can drop the Data, Footnotes, and Region/Country/Area columns:

    primary_enrollment_with_gdp.drop(['Data','Footnotes',

                                      'Region/Country/Area'],

                                      axis=1,inplace=True)

    primary_enrollment_with_gdp

    The output is as follows:

    Figure 9.37: Merged data after dropping the Data, Footnotes, 
and Region/Country/Area columns

    Figure 9.37: Merged data after dropping the Data, Footnotes, and Region/Country/Area columns

  3. Rearrange the columns for proper viewing and presentation to a data scientist:

    primary_enrollment_with_gdp =

    primary_enrollment_with_gdp[['Year',

                                 'Enrollments (Thousands)','GDP']]

    primary_enrollment_with_gdp

    The output is as follows:

    Figure 9.38: Merged data after rearranging the columns

    Figure 9.38: Merged data after rearranging the columns

  4. Plot the data:

    plt.figure(figsize=(8,5))

    plt.title("India's GDP per capita vs primary education "

              "enrollment",fontsize=16)

    plt.scatter(primary_enrollment_with_gdp['GDP'],

                primary_enrollment_with_gdp['Enrollments (Thousands)'],

                edgecolor='k',color='orange',s=200)

    plt.xlabel("GDP per capita (US $)",fontsize=15)

    plt.ylabel("Primary enrollment (thousands)", fontsize=15)

    plt.xticks(fontsize=14)

    plt.yticks(fontsize=14)

    plt.grid(True)

    plt.show()

    The output is as follows:

    Figure 9.39: Scatter plot of merged data

Figure 9.39: Scatter plot of merged data

Note

To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.

You can also run this example online at https://packt.live/3fQ0PXJ.

Activity 9.04: Data Wrangling Task – Connecting the New Data to the Database

Solution:

These are the steps to complete this activity:

  1. Connect to a database and start writing values in it. We start by importing the sqlite3 module of Python and then use the connect function to connect to a database. Designate Year as the PRIMARY KEY of this table:

    import sqlite3

    with sqlite3.connect("Education_GDP.db") as conn:

        cursor = conn.cursor()

        cursor.execute("CREATE TABLE IF NOT EXISTS

                       education_gdp(Year INT, Enrollment

                       FLOAT, GDP FLOAT, PRIMARY KEY (Year))")

  2. Run a loop with the dataset rows one by one to insert them into the table:

    with sqlite3.connect("Education_GDP.db") as conn:

        cursor = conn.cursor()

        for i in range(14):

            year = int(primary_enrollment_with_gdp.iloc[i]['Year'])

            enrollment =

            primary_enrollment_with_gdp.iloc[i]

            ['Enrollments (Thousands)']

            gdp = primary_enrollment_with_gdp.iloc[i]['GDP']

            #print(year,enrollment,gdp)

            cursor.execute("INSERT INTO

                           education_gdp (Year,Enrollment,GDP)

                           VALUES(?,?,?)",(year,enrollment,gdp))

If we look at the current folder, we should see a file called Education_GDP.db, and if we can examine that using a database viewer program, we will see that the data has been transferred there.

Note

To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.

You can also run this example online at https://packt.live/3fQ0PXJ.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.217.122