Solution:
These are the steps to complete this activity:
import random
random_number_list = [random.randint(0, 100)
for x in range(0, 100)]
random_number_list
The sample output is as follows:
Note
The output is susceptible to change since we are generating random numbers.
list_with_divisible_by_3 = [a for a in
random_number_list if a % 3 == 0]
list_with_divisible_by_3
The sample output is as follows:
length_of_random_list = len(random_number_list)
length_of_3_divisible_list = len(list_with_divisible_by_3)
difference = length_of_random_list - length_of_3_divisible_list
difference
The sample output is as follows:
71
NUMBER_OF_EXPERIMENTS = 10
difference_list = []
for i in range(0, NUMBER_OF_EXPERIMENTS):
random_number_list = [random.randint(0, 100)
for x in range(0, 100)]
list_with_divisible_by_3 = [a for a in random_number_list
if a % 3 == 0]
length_of_random_list = len(random_number_list)
length_of_3_divisible_list = len(list_with_divisible_by_3)
difference = length_of_random_list
- length_of_3_divisible_list
difference_list.append(difference)
difference_list
The sample output is as follows:
[64, 61, 67, 60, 73, 66, 66, 75, 70, 61]
avg_diff = sum(difference_list) / float(len(difference_list))
avg_diff
The sample output is as follows:
66.3
Note
The output is susceptible to change since we have used random numbers.
To access the source code for this specific section, please refer to https://packt.live/30VMjt3.
You can also run this example online at https://packt.live/3eh0JIb.
With this, we have successfully completed our first activity. Let's move on to the next section, where we will discuss another type of data structure – sets.
Solution:
These are the steps to complete this activity:
Note
Part of the first chapter of Pride and Prejudice by Jane Austen has been made available on this book's GitHub repository at https://packt.live/2N6ZGP6.
Use Ctrl + A to select the entire text and then Ctrl + C to copy it and use Ctrl + V to paste the text you just copied into it:
type(multiline_text)
The output is as follows:
str
len(multiline_text)
The output is as follows:
1228
multiline_text = multiline_text.replace(' ', "")
multiline_text
The output is as follows:
# remove special chars, punctuation etc.
cleaned_multiline_text = ""
for char in multiline_text:
if char == " ":
cleaned_multiline_text += char
elif char.isalnum(): # using the isalnum() method of strings.
cleaned_multiline_text += char
else:
cleaned_multiline_text += " "
cleaned_multiline_text
The output is as follows:
list_of_words = cleaned_multiline_text.split()
list_of_words
The section of the output is shown below:
len(list_of_words)
The output is 236.
unique_words_as_dict = dict.fromkeys(list_of_words)
len(list(unique_words_as_dict.keys()))
The output is 135.
for word in list_of_words:
if unique_words_as_dict[word] is None:
unique_words_as_dict[word] = 1
else:
unique_words_as_dict[word] += 1
unique_words_as_dict
The section of the output is shown below:
You just created, step by step, a unique word counter using all the neat tricks that you've learned about in this chapter.
top_words = sorted(unique_words_as_dict.items(),
key=lambda key_val_tuple: key_val_tuple[1],
reverse=True)
top_words[:25]
The output (partially shown) is as follows:
Note
To access the source code for this specific section, please refer to https://packt.live/2ASNIWL.
You can also run this example online at https://packt.live/3dcIKkz.
Solution:
These are the detailed steps to solve this activity:
from itertools import permutations, dropwhile
permutations?
dropwhile?
You will see a long list of definitions after each ?. We will skip it here.
permutations(range(3))
The output (which will vary in your case) is as follows:
<itertools.permutations at 0x7f6c6c077af0>
for number_tuple in permutations(range(3)):
print(number_tuple)
assert isinstance(number_tuple, tuple)
The output is as follows:
(0, 1, 2)
(0, 2, 1)
(1, 0, 2)
(1, 2, 0)
(2, 0, 1)
(2, 1, 0)
An extra task can be to check the actual type that dropwhile returns without casting:
for number_tuple in permutations(range(3)):
print(list(dropwhile(lambda x: x <= 0, number_tuple)))
The output is as follows:
[1, 2]
[2, 1]
[1, 0, 2]
[1, 2, 0]
[2, 0, 1]
[2, 1, 0]
import math
def convert_to_number(number_stack):
final_number = 0
for i in range(0, len(number_stack)):
final_number += (number_stack.pop()
* (math.pow(10, i)))
return final_number
for number_tuple in permutations(range(3)):
number_stack = list(dropwhile(lambda x: x <= 0, number_tuple))
print(convert_to_number(number_stack))
The output is as follows:
12.0
21.0
102.0
120.0
201.0
210.0
Note
To access the source code for this specific section, please refer to https://packt.live/37Gk9DT.
You can also run this example online at https://packt.live/3hEWt7f.
Solution:
These are the detailed steps to solve this activity:
from itertools import zip_longest
def return_dict_from_csv_line(header, line):
# Zip them
zipped_line = zip_longest(header, line, fillvalue=None)
# Use dict comprehension to generate the final dict
ret_dict = {kv[0]: kv[1] for kv in zipped_line}
return ret_dict
with open("../datasets/sales_record.csv", "r") as fd:
Note
Don't forget to change the path (highlighted) based on where you have stored the csv file.
first_line = fd.readline()
header = first_line.replace(" ", "").split(",")
for i, line in enumerate(fd):
line = line.replace(" ", "").split(",")
d = return_dict_from_csv_line(header, line)
print(d)
if i > 10:
break
The output (partially shown) is as follows:
Note
To access the source code for this specific section, please refer to https://packt.live/37FlVVK.
You can also run this example online at https://packt.live/2YepGyb.
Solution:
These are the steps to complete this activity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("../datasets/Boston_housing.csv")
Note
Don't forget to change the path of the dataset (highlighted) based on where it is saved on your system.
df.head(10)
The output is as follows:
df.shape
The output is as follows:
(506, 14)
df1=df[['CRIM','ZN','INDUS',
'RM','AGE','DIS','RAD',
'TAX','PTRATIO','PRICE']]
df1.tail(7)
The output is as follows:
for c in df1.columns:
plt.title("Plot of "+c,fontsize=15)
plt.hist(df1[c],bins=20)
plt.show()
The output is as follows:
Note
To take a look at all the plots, head over to the following link: https://packt.live/2AGb95F.
Crime rate could be an indicator of house price (people don't want to live in high-crime areas). In some cases, having multiple charts together can allow for the easy analysis of a variety of variables. In the preceding group of charts, we can see several unique spikes in the data: INDIUS, TAX, and RAD. With further exploratory analysis, we can find out more. We might want to plot one variable against another after looking at the preceding group of charts.
plt.scatter(df1['CRIM'], df1['PRICE'])
plt.show()
The output is as follows:
We can understand this relationship better if we plot log10(crime) versus price.
plt.scatter(np.log10(df1['CRIM']),df1['PRICE'], c='red')
plt.title("Crime rate (Log) vs. Price plot", fontsize=18)
plt.xlabel("Log of Crime rate",fontsize=15)
plt.ylabel("Price",fontsize=15)
plt.grid(True)
plt.show()
The output is as follows:
df1['RM'].mean()
The output is as follows:
6.284634387351788
df1['AGE'].median()
The output is as follows:
77.5
df1['DIS'].mean()
The output is as follows:
3.795042687747034
low_price=df1['PRICE']<20
print(low_price)
The output is as follows:
This creates a Boolean array of True, False, True = 1, and False = 0. If you take an average of this NumPy array, you will know how many 1(True) values are there.
# That many houses are priced below 20,000.
# So that is the answer.
low_price.mean()
The output is:
0.4150197628458498
# You can convert that into percentage
# Do this by multiplying with 100
pcnt=low_price.mean()*100
print(" Percentage of house with <20,000 price is: ", pcnt)
The output is as follows:
Percentage of house with <20,000 price is: 41.50197628458498
Note
To access the source code for this specific section, please refer to https://packt.live/2AGb95F.
You can also run this example online at https://packt.live/2YT3Hfg.
Solution:
These are the steps to complete this activity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("../datasets/adult_income_data.csv")
df.head()
Note
The highlighted path must be changed based on the location of the file on your system.
The output is as follows:
names = []
with open('../datasets/adult_income_names.txt','r') as f:
for line in f:
f.readline()
var=line.split(":")[0]
names.append(var)
names
Note
The highlighted path must be changed based on the location of the file on your system.
The output is as follows:
names.append('Income')
df = pd.read_csv("../datasets/adult_income_data.csv", names=names)
df.head()
Note
The highlighted path must be changed based on the location of the file on your system.
The output is as follows:
df.describe()
The output is as follows:
Note that only a small number of columns are included. Many variables in the dataset have multiple factors or classes.
# Make a list of all variables with classes
vars_class = ['workclass','education','marital-status',
'occupation','relationship','sex','native-country']
for v in vars_class:
classes=df[v].unique()
num_classes = df[v].nunique()
print("There are {} classes in the "{}" column. "
"They are: {}".format(num_classes,v,classes))
print("-"*100)
The output (partially shown) is as follows:
df.isnull().sum()
The output is as follows:
df_subset = df[['age','education', 'occupation']]
df_subset.head()
The output is as follows:
df_subset['age'].hist(bins=20)
The output is as follows:
df_subset.boxplot(column='age',by='education',figsize=(25,10))
plt.xticks(fontsize=15)
plt.xlabel("Education",fontsize=20)
plt.show()
The output is as follows:
Before doing any further operations, we need to use the apply method we learned about in this chapter. It turns out that when reading the dataset from the CSV file, all the strings came with a whitespace character in front of them. So, we need to remove that whitespace from all the strings.
def strip_whitespace(s):
return s.strip()
# Education column
df_subset['education_stripped'] = df['education']
.apply(strip_whitespace)
df_subset['education'] = df_subset['education_stripped']
df_subset.drop(labels = ['education_stripped'],
axis=1,inplace=True)
# Occupation column
df_subset['occupation_stripped'] = df['occupation']
.apply(strip_whitespace)
df_subset['occupation'] = df_subset['occupation_stripped']
df_subset.drop(labels = ['occupation_stripped'],
axis=1,inplace=True)
This is the sample warning message, which you should ignore:
# Conditional clauses and join them by & (AND)
df_filtered=df_subset[(df_subset['age']>=30)
& (df_subset['age']<=50)]
df_filtered.head()
The output is as follows:
answer_1=df_filtered.shape[0]
answer_1
The output is as follows:
16390
print("There are {} people of age between 30 and 50 "
"in this dataset.".format(answer_1))
The output is as follows:
There are 16390 people of age between 30 and 50 in this dataset.
df_subset.groupby('occupation').describe()['age']
The output is as follows:
The code returns 79 rows × 1 columns.
occupation_stats=df_subset.groupby('occupation').describe()['age']
plt.figure(figsize=(15,8))
plt.barh(y=occupation_stats.index,
width=occupation_stats['count'])
plt.yticks(fontsize=13)
plt.show()
The output is as follows:
Is there a particular occupation group that has very low representation? Perhaps we should remove those pieces of data because, with very low data, the group won't be useful in analysis. Just by looking at Figure 4.89, you should be able to see that the Armed-Forces group has only got a 9 count, that is, 9 data points. But how can we detect this? By plotting the count column in a bar chart. Note how the first argument to the barh function is the index of the DataFrame, which is the summary stats of the occupation groups. We can see that the Armed-Forces group has almost no data. This activity teaches you that, sometimes, the outlier is not just a value, but can be a whole group. The data of this group is fine, but it is too small to be useful for any analysis. So, it can be treated as an outlier in this case. But always use your business knowledge and engineering judgment for such outlier detection and how to process them. We will now practice merging two datasets using a common key.
df_1 = df[['age','workclass','occupation']]
.sample(5,random_state=101)
df_1.head()
The output is as follows:
The second dataset is as follows:
df_2 = df[['education','occupation']].sample(5,random_state=101)
df_2.head()
The output is as follows:
df_merged = pd.merge(df_1,df_2,on='occupation',
how='inner').drop_duplicates()
df_merged
The output is as follows:
Note
To access the source code for this specific section, please refer to https://packt.live/37IamwR.
You can also run this example online at https://packt.live/2YhuF1j.
Solution:
These are the steps to complete this activity:
from bs4 import BeautifulSoup
import pandas as pd
fd = open("../datasets/List of countries by GDP (nominal) "
"- Wikipedia.htm", "r", encoding = "utf-8")
soup = BeautifulSoup(fd)
fd.close()
Note
Don't forget to change the path of the dataset (highlighted) based on its location on your system
all_tables = soup.find_all("table")
print("Total number of tables are {} ".format(len(all_tables)))
There are nine tables in total.
data_table = soup.find("table", {"class": '"wikitable"|}'})
print(type(data_table))
The output is as follows:
<class 'bs4.element.Tag'>
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
The output is as follows:
3
data = data_table.tbody.findAll('tr', recursive=False)[1]
.findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
len(data_tables)
The output is as follows:
3
source_names = [source.findAll('a')[0].getText()
for source in sources_list]
print(source_names)
The output is as follows:
['International Monetary Fund', 'World Bank', 'United Nations']
header1 = [th.getText().strip() for th in
data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
The output is as follows:
['Rank', 'Country', 'GDP(US$MM)']
rows1 = data_tables[0][0].findAll('tbody')[0].findAll('tr')[1:]
data_rows1 = [[td.get_text().strip() for td in
tr.findAll('td')] for tr in rows1]
df1 = pd.DataFrame(data_rows1, columns=header1)
df1.head()
The output is as follows:
header2 = [th.getText().strip() for th in data_tables[1][0]
.findAll('thead')[0].findAll('th')]
header2
The output is as follows:
['Rank', 'Country', 'GDP(US$MM)']
rows2 = data_tables[1][0].findAll('tbody')[0].findAll('tr')
def find_right_text(i, td):
if i == 0:
return td.getText().strip()
elif i == 1:
return td.getText().strip()
else:
index = td.text.find("♠")
return td.text[index+1:].strip()
data_rows2 = [[find_right_text(i, td) for i, td in
enumerate(tr.findAll('td'))] for tr in rows2]
df2 = pd.DataFrame(data_rows2, columns=header2)
df2.head()
The output is as follows:
header3 = [th.getText().strip() for th in data_tables[2][0]
.findAll('thead')[0].findAll('th')]
header3
The output is as follows:
['Rank', 'Country', 'GDP(US$MM)']
rows3 = data_tables[2][0].findAll('tbody')[0].findAll('tr')
data_rows3 = [[find_right_text(i, td) for i, td in
enumerate(tr.findAll('td'))] for tr in rows2]
df3 = pd.DataFrame(data_rows3, columns=header3)
df3.head()
The output is as follows:
Note
To access the source code for this specific section, please refer to https://packt.live/2NaCrDB.
You can also run this example online at https://packt.live/2YRAukP.
Solution:
The steps to completing this activity are as follows:
Note
The dataset to be used for this activity can be found at https://packt.live/2YajrLJ.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("../datasets/visit_data.csv")
Note
Don't forget to change the path (highlighted) based on where the CSV file is saved on your system.
df.head()
The output is as follows:
As we can see, there is data where some values are missing, and if we examine this, we will see some outliers.
print("First name is duplicated - {}"
.format(any(df.first_name.duplicated())))
print("Last name is duplicated - {}"
.format(any(df.last_name.duplicated())))
print("Email is duplicated - {}"
.format(any(df.email.duplicated())))
The output is as follows:
First name is duplicated - True
Last name is duplicated - True
Email is duplicated - False
There are duplicates in both the first and last names, which is normal. However, as we can see, there are no duplicates in email. That's good.
"""
Notice that we have different ways to
format boolean values for the % operator
"""
print("The column Email contains NaN - %r " %
df.email.isnull().values.any())
print("The column IP Address contains NaN - %s " %
df.ip_address.isnull().values.any())
print("The column Visit contains NaN - %s " %
df.visit.isnull().values.any())
The output is as follows:
The column Email contains NaN - False
The column IP Address contains NaN - False
The column Visit contains NaN - True
The Visit column contains some NaN values. Given that the final task at hand will probably be predicting the number of visits, we cannot do anything with rows that do not have that information. They are a type of outlier. Let's get rid of them.
"""
There are various ways to do this. This is just one way. We encourage you to explore other ways. But before that we need to store the previous size of the data set and we will compare it with the new size
"""
size_prev = df.shape
df = df[np.isfinite(df['visit'])]
#This is an inplace operation.
# After this operation the original DataFrame is lost.
size_after = df.shape
# Notice how parameterized format is used.
# Then, the indexing is working inside the quote marks
print("The size of previous data was - {prev[0]} rows and "
"the size of the new one is - {after[0]} rows"
.format(prev=size_prev, after=size_after))
The output is as follows:
The size of previous data was - 1000 rows and the size of the new one is - 974 rows
plt.boxplot(df.visit, notch=True)
The box plot is as follows:
As we can see, we have data in this column in the interval (0, 3000). However, the main concentration of the data is between ~700 and ~2300.
df1 = df[(df['visit'] <= 2900) & (df['visit'] >= 100)]
# Notice the powerful & operator
"""
Here we abuse the fact the
number of variable can be greater
than the number of replacement targets
"""
print("After getting rid of outliers the new size of the data "
"is - {}".format(*df1.shape))
The output is as follows:
After getting rid of outliers the new size of the data is - 923
Note
To access the source code for this specific section, please refer to https://packt.live/2AFcSbn.
You can also run this example online at https://packt.live/3fAL9qY
Solution:
These are the steps to complete this activity:
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
import ssl
import re
top100url = 'https://www.gutenberg.org/browse/scores/top'
response = requests.get(top100url)
def status_check(r):
if r.status_code==200:
print("Success!")
return 1
else:
print("Failed!")
return -1
status_check(response)
The output is as follows:
Success!
1
contents = response.content.decode(response.encoding)
soup = BeautifulSoup(contents, 'html.parser')
# Empty list to hold all the http links in the HTML page
lst_links=[]
# Find all href tags and store them in the list of links
for link in soup.find_all('a'):
#print(link.get('href'))
lst_links.append(link.get('href'))
lst_links[:30]
The output (partially shown) is as follows:
['/wiki/Main_Page',
'/catalog/',
'/ebooks/',
'/browse/recent/last1',
'/browse/scores/top',
'/wiki/Gutenberg:Offline_Catalogs',
'/catalog/world/mybookmarks',
'/wiki/Main_Page',
'https://www.paypal.com/xclick/business=donate%40gutenberg.org&item_name=Donation+to+Project+Gutenberg',
'/wiki/Gutenberg:Project_Gutenberg_Needs_Your_Donation',
'http://www.ibiblio.org',
'http://www.pgdp.net/',
'pretty-pictures',
'#books-last1',
'#authors-last1',
'#books-last7',
'#authors-last7',
'#books-last30',
'#authors-last30',
'/ebooks/1342',
'/ebooks/84',
'/ebooks/1080',
'/ebooks/46',
'/ebooks/219',
'/ebooks/2542',
'/ebooks/98',
'/ebooks/345',
'/ebooks/2701',
'/ebooks/844',
'/ebooks/11']
booknum=[]
Numbers 19 to 118 in the original list of links have the top 100 eBooks' numbers.
for i in range(19,119):
link=lst_links[i]
link=link.strip()
"""
Regular expression to find the numeric digits in the link (href) string
"""
n=re.findall('[0-9]+',link)
if len(n)==1:
# Append the filenumber casted as integer
booknum.append(int(n[0]))
print(" The file numbers for the top 100 ebooks",
"on Gutenberg are shown below "+"-"*70)
print(booknum)
The output is as follows:
The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1342, 84, 1080, 46, 219, 2542, 98, 345, 2701, 844, 11, 5200,
43, 16328, 76, 74, 1952, 6130, 2591, 1661, 41, 174, 23, 1260,
1497, 408, 3207, 1400, 30254, 58271, 1232, 25344, 58269, 158,
44881, 1322, 205, 2554, 1184, 2600, 120, 16, 58276, 5740, 34901,
28054, 829, 33, 2814, 4300, 100, 55, 160, 1404, 786, 58267, 3600,
19942, 8800, 514, 244, 2500, 2852, 135, 768, 58263, 1251, 3825,
779, 58262, 203, 730, 20203, 35, 1250, 45, 161, 30360, 7370,
58274, 209, 27827, 58256, 33283, 4363, 375, 996, 58270, 521,
58268, 36, 815, 1934, 3296, 58279, 105, 2148, 932, 1064, 13415]
Note
Since the list of top 100 books is frequently updated, the output you get will vary.
What does the soup object's text look like?
You will notice a lot of empty spaces/blanks here and there. Ignore them. They are part of the HTML page's markup and its whimsical nature:
print(soup.text[:2000])
The output is as follows:
if (top != self) {
top.location.replace (http://www.gutenberg.org);
alert ('Project Gutenberg is a FREE service with NO membership required. If you paid somebody else to get here, make them give you your money back!');
}
Top 100 - Project Gutenberg
Online Book Catalog
Book Search
-- Recent Books
-- Top 100
-- Offline Catalogs
-- My Bookmarks
Main Page
…
Pretty Pictures
Top 100 EBooks yesterday —
Top 100 Authors yesterday —
Top 100 EBooks last 7 days —
Top 100 Authors last 7 days —
Top 100 EBooks last 30 days —
Top 100 Authors last 30 days
Top 100 EBooks yesterday
Pride and Prejudice by Jane Austen (1826)
Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (1367)
A Modest Proposal by Jonathan Swift (1020)
A Christmas Carol in Prose; Being a Ghost Story of Christmas by Charles Dickens (953)
Heart of Darkness by Joseph Conrad (887)
Et dukkehjem. English by Henrik Ibsen (761)
A Tale of Two Cities by Charles Dickens (741)
Dracula by Bram Stoker (732)
Moby Dick; Or, The Whale by Herman Melville (651)
The Importance of Being Earnest: A Trivial Comedy for Serious People by Oscar Wilde (646)
Alice's Adventures in Wonderland by Lewis Carrol
lst_titles_temp=[]
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')
Note
Since the list of top 100 books is frequently updated, the output you get will vary.
for i in range(100):
lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])
lst_titles=[]
for i in range(100):
id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
lst_titles.append(lst_titles_temp[i][id1:id2])
for l in lst_titles:
print(l)
The partial output is as follows:
Pride and Prejudice by Jane Austen
Frankenstein
A Modest Proposal by Jonathan Swift
A Christmas Carol in Prose
Heart of Darkness by Joseph Conrad
Et dukkehjem
A Tale of Two Cities by Charles Dickens
Dracula by Bram Stoker
Moby Dick
The Importance of Being Earnest
Alice
Metamorphosis by Franz Kafka
The Strange Case of Dr
Beowulf
…
The Russian Army and the Japanese War
Calculus Made Easy by Silvanus P
Beyond Good and Evil by Friedrich Wilhelm Nietzsche
An Occurrence at Owl Creek Bridge by Ambrose Bierce
Don Quixote by Miguel de Cervantes Saavedra
Blue Jackets by Edward Greey
The Life and Adventures of Robinson Crusoe by Daniel Defoe
The Waterloo Campaign
The War of the Worlds by H
Democracy in America
Songs of Innocence
The Confessions of St
Modern French Masters by Marie Van Vorst
Persuasion by Jane Austen
The Works of Edgar Allan Poe
The Fall of the House of Usher by Edgar Allan Poe
The Masque of the Red Death by Edgar Allan Poe
The Lady with the Dog and Other Stories by Anton Pavlovich Chekhov
Note
Since the list of top 100 books is frequently updated, the output you get will vary.
Here, we have seen how we can use web scraping and parsing using a mix of BeautifulSoup and regex to find information from very untidy and vast source data. These are some essential steps that you will have to perform on a daily basis when you are dealing with data wrangling.
Note
To access the source code for this specific section, please refer to https://packt.live/2BltmFo.
You can also run this example online at https://packt.live/37FdLwD.
SOLUTION
Note
Before you begin, ensure that you modify the APIkeys.json file and add your secret API key there. Link to the file: https://packt.live/2CmIpze.
These are the steps to complete this activity:
import urllib.request, urllib.parse, urllib.error
import json
Note
The following cell will not be executed in the solution notebook because the author cannot give out their private API key.
The students/users will need to obtain a key and store it in a JSON file. We are calling this file APIkeys.json.
with open('APIkeys.json') as f:
keys = json.load(f)
omdbapi = keys['OMDBapi']
The final URL to be passed should look like this: http://www.omdbapi.com/?t=movie_name&apikey=secretapikey.
serviceurl = 'http://www.omdbapi.com/?'
apikey = '&apikey='+omdbapi
def print_json(json_data):
list_keys = ['Title', 'Year', 'Rated', 'Released',
'Runtime', 'Genre', 'Director', 'Writer',
'Actors', 'Plot', 'Language', 'Country',
'Awards', 'Ratings','Metascore', 'imdbRating',
'imdbVotes', 'imdbID']
print("-"*50)
for k in list_keys:
if k in list(json_data.keys()):
print(f"{k}: {json_data[k]}")
print("-"*50)
def save_poster(json_data):
import os
title = json_data['Title']
poster_url = json_data['Poster']
"""
Splits the poster url by '.' and
picks up the last string as file extension
"""
poster_file_extension=poster_url.split('.')[-1]
# Reads the image file from web
poster_data = urllib.request.urlopen(poster_url).read()
savelocation=os.getcwd()+''+'Posters'+''
"""
Creates new directory if the directory does not exist.
Otherwise, just use the existing path.
"""
if not os.path.isdir(savelocation):
os.mkdir(savelocation)
filename=savelocation+str(title)
+'.'+poster_file_extension
f=open(filename,'wb')
f.write(poster_data)
f.close()
def search_movie(title):
try:
url = serviceurl
+ urllib.parse.urlencode({'t':str(title)})+apikey
print(f'Retrieving the data of "{title}" now... ')
print(url)
uh = urllib.request.urlopen(url)
data = uh.read()
json_data=json.loads(data)
if json_data['Response']=='True':
print_json(json_data)
"""
Asks user whether to download the poster of the movie
"""
if json_data['Poster']!='N/A':
save_poster(json_data)
else:
print("Error encountered: ", json_data['Error'])
except urllib.error.URLError as e:
print(f"ERROR: {e.reason}")
search_movie("Titanic")
The following is the retrieved data for Titanic:
http://www.omdbapi.com/?t=Titanic&apikey=<your api key>
--------------------------------------------------
Title: Titanic
Year: 1997
Rated: PG-13
Released: 19 Dec 1997
Runtime: 194 min
Genre: Drama, Romance
Director: James Cameron
Writer: James Cameron
Actors: Leonardo DiCaprio, Kate Winslet, Billy Zane, Kathy Bates
Plot: A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic.
Language: English, Swedish
Country: USA
Awards: Won 11 Oscars. Another 111 wins & 77 nominations.
Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.8/10'}, {'Source': 'Rotten Tomatoes', 'Value': '89%'}, {'Source': 'Metacritic', 'Value': '75/100'}]
Metascore: 75
imdbRating: 7.8
imdbVotes: 913,780
imdbID: tt0120338
--------------------------------------------------
search_movie("Random_error")
Retrieve the data of Random_error:
Retrieving the data of "Random_error" now...
http://www.omdbapi.com/?t=Random_error&apikey=<your api key>
Error encountered: Movie not found!
Note
In the last two steps, we've not shown the private API key (highlighted) for security reasons.
To access the source code for this specific section, please refer to https://packt.live/3hLJvoy.
You can also run this example online at https://packt.live/3efkDTZ.
Solution:
These are the steps to complete this activity:
import sqlite3
conn = sqlite3.connect("petsdb")
# a tiny function to make sure the connection is successful
def is_opened(conn):
try:
conn.execute("SELECT * FROM persons LIMIT 1")
return True
except sqlite3.ProgrammingError as e:
print("Connection closed {}".format(e))
return False
print(is_opened(conn))
The output is as follows:
True
conn.close()
print(is_opened(conn))
The output is as follows:
Connection closed Cannot operate on a closed database.
False
conn = sqlite3.connect("petsdb")
c = conn.cursor()
for ppl, age in c.execute("SELECT count(*),
age FROM persons GROUP BY age"):
print("We have {} people aged {}".format(ppl, age))
The output is as follows:
for ppl, age in c.execute("SELECT count(*), age FROM persons
GROUP BY age ORDER BY count(*)DESC"):
print("The highest number of people is {} and "
"came from {} age group".format(ppl, age))
break
The output is as follows:
The highest number of people is 5 and came from 73 age group
res = c.execute("SELECT count(*) FROM persons
WHERE last_name IS null")
for row in res:
print(row)
The output is as follows:
(60,)
res = c.execute("SELECT count(*) FROM
(SELECT count(owner_id) FROM pets
GROUP BY owner_id HAVING count(owner_id) >1)")
for row in res:
print("{} people have more than one pets".format(row[0]))
The output is as follows:
43 People have more than one pets
res = c.execute("SELECT count(*) FROM pets
WHERE treatment_done=1")
for row in res:
print(row)
The output is as follows:
(36,)
res = c.execute("SELECT count(*) FROM pets
WHERE treatment_done=1 AND pet_type IS NOT null")
for row in res:
print(row)
The output is as follows:
(16,)
res = c.execute("SELECT count(*) FROM pets
JOIN persons ON pets.owner_id = persons.id
WHERE persons.city='east port'")
for row in res:
print(row)
The output is as follows:
(49,)
res = c.execute("SELECT count(*) FROM pets
JOIN persons ON pets.owner_id =
persons.id WHERE persons.city='east port'
AND pets.treatment_done=1")
for row in res:
print(row)
The output is as follows:
(11,)
Note
To access the source code for this specific section, please refer to https://packt.live/3derN9D.
You can also run this example online at https://packt.live/2ASWYKi.
Solution:
These are the steps to complete this activity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
education_data_link="http://data.un.org/_Docs/SYB/CSV/"
"SYB61_T07_Education.csv"
df1 = pd.read_csv(education_data_link)
df1.head()
The output (partially shown) is as follows:
df1 = pd.read_csv(education_data_link,skiprows=1)
df1.head()
The output is as follows:
df2 = df1.drop(['Region/Country/Area','Source'],axis=1)
df2.columns=['Region/Country/Area','Year','Data',
'Enrollments (Thousands)','Footnotes']
df2.head()
The output is as follows:
df2['Footnotes'].unique()
The output is as follows:
type(df2['Enrollments (Thousands)'][0])
The output is as follows:
str
def to_numeric(val):
"""
Converts a given string (with one or more commas) to a numeric value
"""
if ',' not in str(val):
result = float(val)
else:
val=str(val)
val=''.join(str(val).split(','))
result=float(val)
return result
df2['Enrollments (Thousands)']=df2['Enrollments (Thousands)']
.apply(to_numeric)
df2['Data'].unique()
The output is as follows:
df_primary: Only students enrolled in primary education (thousands)
df_secondary: Only students enrolled in secondary education (thousands)
df_tertiary: Only students enrolled in tertiary education (thousands):
df_primary = df2[df2['Data']=='Students enrolled in primary '
'education (thousands)']
df_secondary = df2[df2['Data']=='Students enrolled in secondary '
'education (thousands)']
df_tertiary = df2[df2['Data']=='Students enrolled in tertiary '
'education (thousands)']
primary_enrollment_india = df_primary[df_primary
['Region/Country/Area']=='India']
primary_enrollment_USA = df_primary[df_primary
['Region/Country/Area']
=='United States of America']
primary_enrollment_india
The output is as follows:
primary_enrollment_USA
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_india['Year'],
primary_enrollment_india['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in India "
"(in thousands)",fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_USA['Year'],
primary_enrollment_USA['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in the "
"United States of America (in thousands)",fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
As we can see, we have missing data. Now is the time to use pandas methods to do the data imputation. But to do that, we need to create a DataFrame with missing values inserted into it – that is, we need to append another DataFrame with missing values to the current DataFrame.
missing_years = [y for y in range(2004,2010)]
+[y for y in range(2011,2014)]
missing_years
The output is as follows:
[2004, 2005, 2006, 2007, 2008, 2009, 2011, 2012, 2013]
dict_missing =
{'Region/Country/Area':['India']*9,
'Year':missing_years,
'Data':'Students enrolled in primary education(thousands)'*9,
'Enrollments (Thousands)':[np.nan]*9,'Footnotes':[np.nan]*9}
df_missing = pd.DataFrame(data=dict_missing)
primary_enrollment_india=primary_enrollment_india
.append(df_missing,ignore_index=True,
sort=True)
primary_enrollment_india
The output is as follows:
primary_enrollment_india.sort_values(by='Year',inplace=True)
primary_enrollment_india.reset_index(inplace=True,drop=True)
primary_enrollment_india
The output is as follows:
primary_enrollment_india.interpolate(inplace=True)
primary_enrollment_india
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_india['Year'],
primary_enrollment_india['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in India "
"(in thousands)", fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
missing_years = [2004]+[y for y in range(2006,2010)]
+[y for y in range(2011,2014)]+[2016]
missing_years
The output is as follows:
[2004, 2006, 2007, 2008, 2009, 2011, 2012, 2013, 2016]
dict_missing =
{'Region/Country/Area':['United States of America']*9,
'Year':missing_years,
'Data':'Students enrolled in primary education (thousands)'*9,
'Value':[np.nan]*9,'Footnotes':[np.nan]*9}
df_missing = pd.DataFrame(data=dict_missing)
primary_enrollment_USA=primary_enrollment_USA
.append(df_missing,
ignore_index =True,sort=True)
primary_enrollment_USA.sort_values(by='Year',inplace=True)
primary_enrollment_USA.reset_index(inplace=True,drop=True)
primary_enrollment_USA.interpolate(inplace=True)
primary_enrollment_USA
The output is as follows:
primary_enrollment_USA.interpolate(method='linear',
limit_direction='backward',
limit=1)
The output is as follows:
primary_enrollment_USA
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_USA['Year'],
primary_enrollment_USA['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in the "
"United States of America (in thousands)",fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
Note
To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.
You can also run this example online at https://packt.live/3fQ0PXJ.
Solution:
These are the steps to complete this activity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
df3=pd.read_csv("../datasets/India_World_Bank_Info.csv")
Note
Throughout this activity, don't forget to change the path of the dataset (highlighted) to match its location on your system.
The output (partially shown) is as follows:
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-45-9239cae67df7> in <module>()
...
ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3
We can try and use the error_bad_lines=False option in this kind of situation.
df3=pd.read_csv("../datasets/India_World_Bank_Info.csv",
error_bad_lines=False)
The output (partially shown) will be:
df3.head(10)
The output is as follows:
Note
At times, the output may not be found because there are three rows instead of the expected one row.
df3=pd.read_csv("../datasets/India_World_Bank_Info.csv",
error_bad_lines=False,delimiter=' ')
df3.head(10)
The output is as follows:
df3=pd.read_csv("../datasets/India_World_Bank_Info.csv",
error_bad_lines=False,delimiter=' ',
skiprows=4)
df3.head(10)
The output is as follows:
Closely examine the dataset. In this file, the columns are the yearly data and the rows are the various types of information. Upon examining the file with Excel, we find that the Indicator Name column is the one with the name of a particular data type, which is GDP per capita. We filter the dataset with the information we are interested in and also transpose (the rows and columns are interchanged) it to put it in a similar format to what our previous education dataset was in:
df4=df3[df3['Indicator Name']=='GDP per capita (current US$)'].T
df4.head(10)
The output is as follows:
df4.reset_index(inplace=True)
df4.head(10)
The output is as follows:
df4.drop([0,1,2],inplace=True)
df4.reset_index(inplace=True,drop=True)
df4.head(10)
The output is as follows:
df4.columns=['Year','GDP']
df4.head(10)
The output is as follows:
df4.tail(20)
The output is as follows:
df_gdp=df4.iloc[[i for i in range(43,57)]]
df_gdp
The output is as follows:
df_gdp.reset_index(inplace=True,drop=True)
df_gdp
The output is as follows:
df_gdp['Year']
The output is as follows:
df_gdp['Year']=df_gdp['Year'].apply(int)
Note
To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.
You can also run this example online at https://packt.live/3fQ0PXJ.
These are the steps to complete this activity:
primary_enrollment_with_gdp=
primary_enrollment_india.merge(df_gdp,on='Year')
primary_enrollment_with_gdp
The output is as follows:
primary_enrollment_with_gdp.drop(['Data','Footnotes',
'Region/Country/Area'],
axis=1,inplace=True)
primary_enrollment_with_gdp
The output is as follows:
primary_enrollment_with_gdp =
primary_enrollment_with_gdp[['Year',
'Enrollments (Thousands)','GDP']]
primary_enrollment_with_gdp
The output is as follows:
plt.figure(figsize=(8,5))
plt.title("India's GDP per capita vs primary education "
"enrollment",fontsize=16)
plt.scatter(primary_enrollment_with_gdp['GDP'],
primary_enrollment_with_gdp['Enrollments (Thousands)'],
edgecolor='k',color='orange',s=200)
plt.xlabel("GDP per capita (US $)",fontsize=15)
plt.ylabel("Primary enrollment (thousands)", fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.grid(True)
plt.show()
The output is as follows:
Note
To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.
You can also run this example online at https://packt.live/3fQ0PXJ.
Solution:
These are the steps to complete this activity:
import sqlite3
with sqlite3.connect("Education_GDP.db") as conn:
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS
education_gdp(Year INT, Enrollment
FLOAT, GDP FLOAT, PRIMARY KEY (Year))")
with sqlite3.connect("Education_GDP.db") as conn:
cursor = conn.cursor()
for i in range(14):
year = int(primary_enrollment_with_gdp.iloc[i]['Year'])
enrollment =
primary_enrollment_with_gdp.iloc[i]
['Enrollments (Thousands)']
gdp = primary_enrollment_with_gdp.iloc[i]['GDP']
#print(year,enrollment,gdp)
cursor.execute("INSERT INTO
education_gdp (Year,Enrollment,GDP)
VALUES(?,?,?)",(year,enrollment,gdp))
If we look at the current folder, we should see a file called Education_GDP.db, and if we can examine that using a database viewer program, we will see that the data has been transferred there.
Note
To access the source code for this specific section, please refer to https://packt.live/3fyIqy8.
You can also run this example online at https://packt.live/3fQ0PXJ.
18.217.217.122