In this example, we will be extracting data from American Hockey League (AHL) Playoff results, which are available from http://www.flyershistory.com/cgi-bin/ml-poffs.cgi:
The preceding URL contains the Playoff results for the AHL. This page presents information about the results in tabular format. The portion of the page source that shows relevant information is shown in the following screenshot:
With the source format analyzed, it's also necessary to point out that <td> containing the desired values has no attributes that can be used to identify particular table cells. In this case, we can target the position of <td> or cell with data by using CSS selectors, that is, pseudo selectors such as td:eq(0) or td:eq(1).
Since we will be using pyquery for this example, we will use the eq() method, which accepts the index and returns the element. For example, we could use tr.find('td').eq(1).text() for the chosen PyQuery object, tr, search for the element td, that is, <td>, with the index equal to 1, and return the text of the element.
Here, we are interested in collecting data for the columns that are listed in keys:
keys = ['year','month','day','game_date','team1', 'team1_score', 'team2', 'team2_score', 'game_status']
Now, let's import the code with pyquery and re. Regex will be used to separate the date that was obtained from the page source:
from pyquery import PyQuery as pq
import re
sourceUrl = 'http://www.flyershistory.com/cgi-bin/ml-poffs.cgi'
dataSet = list()
keys = ['year','month','day','game_date','team1', 'team1_score', 'team2', 'team2_score', 'game_status']
def read_url(url):
"""Read given Url , Returns pyquery object for page content"""
pageSource = pq(url)
return pq(pageSource)
if __name__ == '__main__':
page = read_url(sourceUrl)
Here, read_url() accepts one argument, that is, the link to the page, and returns the PyQuery object of the page source or pageSource. PyQuery automatically returns the page source for the provided URL. The page source can also be obtained by using other libraries, such as urllib, urllib3, requests, and LXML, and passed to create a PyQuery object:
tableRows = page.find("h1:contains('AHL Playoff Results') + table tr")
print(" Total rows found :", tableRows.__len__())
tableRows is a PyQuery object that will be used to traverse <tr> that exists inside <table>, which is located after <h1>. It contains the AHL Playoff Results text, which is obtained by using the find() function. As we can see in the following output, a total of 463 <tr> elements exist, but the actual number of records that were obtained might be lower, in terms of the number of available <td> with the actual data:
Total rows found : 463
Let's do some more processing. Each <tr> or tr element is an item of tableRows and is traversed with the help of the items() method to find the exact <td> or td by using their index and retrieving the data it contains:
for tr in tableRows.items():
#few <tr> contains single <td> and is omitted using the condition
team1 = tr.find('td').eq(1).text()
if team1 != '':
game_date = tr.find('td').eq(0).text()
dates = re.search(r'(.*)-(.*)-(.*)',game_date)
team1_score = tr.find('td').eq(2).text()
team2 = tr.find('td').eq(4).text()
team2_score = tr.find('td').eq(5).text()
#check Game Status should be either 'W' or 'L'
game_status = tr.find('td').eq(6).text()
if not re.match(r'[WL]',game_status):
game_status = tr.find('td').eq(7).text()
#breaking down date in year,month and day
year = dates.group(3)
month = dates.group(2)
day = dates.group(1)
#preparing exact year value
if len(year)==2 and int(year)>=68:
year = '19'+year
elif len(year)==2 and int(year) <68:
year = '20'+year
else:
pass
So far, the desired data from the targeted <td> has been collected and also formatted in the case of year. Regex has also been applied in the code and used with dates and game_status. Finally, the collected objects are appended as a list to dataSet:
#appending individual data list to the dataSet
dataSet.append([year,month,day,game_date,team1,team1_score,team2,team2_score,game_status])
print(" Total Game Status, found :", len(dataSet))
print(dataSet)
The output regarding the total record count and dataSet is as follows:
Total Game Status, found : 341
[['1968', 'Apr', '3', '3-Apr-68', 'Buff', '2', 'Que', '4', 'W'],
['1968', 'Apr', '5', '5-Apr-68', 'Buff', '1', 'Que', '3', 'W'],
['1968', 'Apr', '9', '9-Apr-68', 'Que', '7', 'Buff', '10', 'L'],
['1968', 'Apr', '10', '10-Apr-68', 'Que', '4', 'Buff', '7', 'L'],
['1968', 'Apr', '12', '12-Apr-68', 'Buff', '1', 'Que', '3', 'W'],
.................
['2008', 'May', '9', '9-May-2008', 'Phantoms', '3', 'Wilkes-Barre', '1', 'L'],
['2009', 'Apr', '16', '16-Apr-09', 'Phantoms', '2', 'Hershey', '4', 'L'],
['2009', 'Apr', '18', '18-Apr-09', 'Phantoms', '2', 'Hershey', '6', 'L'],
['2009', 'Apr', '22', '22-Apr-09', 'Hershey', '2', 'Phantoms', '3', 'L'],
['2009', 'Apr', '24', '24-Apr-09', 'Hershey', '0', 'Phantoms', '1', 'L']]