Example 3 – extracting AHL Playoff results

In this example, we will be extracting data from American Hockey League (AHL) Playoff results, which are available from http://www.flyershistory.com/cgi-bin/ml-poffs.cgi:

 AHL Playoff results

The preceding URL contains the Playoff results for the AHL. This page presents information about the results in tabular format. The portion of the page source that shows relevant information is shown in the following screenshot:

Page source from http://www.flyershistory.com/cgi-bin/ml-poffs.cgi
The preceding screenshot contains the top and bottom part of the tabular information from the source URL and presents two different formats of <tr> that are available in the page source. The number of <td> that are available in <tr> have different, extra information.

With the source format analyzed, it's also necessary to point out that <td> containing the desired values has no attributes that can be used to identify particular table cells. In this case, we can target the position of <td> or cell with data by using CSS selectors, that is, pseudo selectors such as td:eq(0) or td:eq(1).

For more information on CSS selectors, please visit Chapter 3, Using LXML, XPath, and CSS Selectors, the Introduction to XPath and CSS selector section, in the CSS Selectors and Pseudo Selectors sub-section.

Since we will be using pyquery for this example, we will use the eq() method, which accepts the index and returns the element. For example, we could use tr.find('td').eq(1).text() for the chosen PyQuery object, tr, search for the element td, that is, <td>, with the index equal to 1, and return the text of the element.

Here, we are interested in collecting data for the columns that are listed in keys:

keys = ['year','month','day','game_date','team1', 'team1_score', 'team2', 'team2_score', 'game_status']

Now, let's import the code with pyquery and re. Regex will be used to separate the date that was obtained from the page source:

from pyquery import PyQuery as pq
import re

sourceUrl = 'http://www.flyershistory.com/cgi-bin/ml-poffs.cgi'
dataSet = list()
keys = ['year','month','day','game_date','team1', 'team1_score', 'team2', 'team2_score', 'game_status']

def read_url(url):
"""Read given Url , Returns pyquery object for page content"""
pageSource = pq(url)
return pq(pageSource)

if __name__ == '__main__':
page = read_url(sourceUrl)

 

Here, read_url() accepts one argument, that is, the link to the page, and returns the PyQuery object of the page source or pageSource. PyQuery automatically returns the page source for the provided URL. The page source can also be obtained by using other libraries, such as urllib, urllib3, requests, and LXML, and passed to create a PyQuery object:

tableRows = page.find("h1:contains('AHL Playoff Results') + table tr")
print(" Total rows found :", tableRows.__len__())

tableRows is a PyQuery object that will be used to traverse <tr> that exists inside <table>, which is located after <h1>. It contains the AHL Playoff Results text, which is obtained by using the find() function. As we can see in the following output, a total of 463 <tr> elements exist, but the actual number of records that were obtained might be lower, in terms of the number of available <td> with the actual data:

Total rows found : 463

Let's do some more processing. Each <tr> or tr element is an item of tableRows and is traversed with the help of the items() method to find the exact <td> or td by using their index and retrieving the data it contains:

for tr in tableRows.items():
#few <tr> contains single <td> and is omitted using the condition
team1 = tr.find('td').eq(1).text()

if team1 != '':
game_date = tr.find('td').eq(0).text()
dates = re.search(r'(.*)-(.*)-(.*)',game_date)
team1_score = tr.find('td').eq(2).text()
team2 = tr.find('td').eq(4).text()
team2_score = tr.find('td').eq(5).text()

#check Game Status should be either 'W' or 'L'
game_status = tr.find('td').eq(6).text()
if not re.match(r'[WL]',game_status):
game_status = tr.find('td').eq(7).text()

#breaking down date in year,month and day
year = dates.group(3)
month = dates.group(2)
day = dates.group(1)

#preparing exact year value
if len(year)==2 and int(year)>=68:
year = '19'+year
elif len(year)==2 and int(year) <68:
year = '20'+year
else:
pass

So far, the desired data from the targeted <td> has been collected and also formatted in the case of year. Regex has also been applied in the code and used with dates and game_status. Finally, the collected objects are appended as a list to dataSet:

#appending individual data list to the dataSet
dataSet.append([year,month,day,game_date,team1,team1_score,team2,team2_score,game_status])

print(" Total Game Status, found :", len(dataSet))
print(dataSet)

The output regarding the total record count and dataSet is as follows:

Total Game Status, found : 341

[['1968', 'Apr', '3', '3-Apr-68', 'Buff', '2', 'Que', '4', 'W'],
['1968', 'Apr', '5', '5-Apr-68', 'Buff', '1', 'Que', '3', 'W'],
['1968', 'Apr', '9', '9-Apr-68', 'Que', '7', 'Buff', '10', 'L'],
['1968', 'Apr', '10', '10-Apr-68', 'Que', '4', 'Buff', '7', 'L'],
['1968', 'Apr', '12', '12-Apr-68', 'Buff', '1', 'Que', '3', 'W'],
.................
['2008', 'May', '9', '9-May-2008', 'Phantoms', '3', 'Wilkes-Barre', '1', 'L'],
['2009', 'Apr', '16', '16-Apr-09', 'Phantoms', '2', 'Hershey', '4', 'L'],
['2009', 'Apr', '18', '18-Apr-09', 'Phantoms', '2', 'Hershey', '6', 'L'],
['2009', 'Apr', '22', '22-Apr-09', 'Hershey', '2', 'Phantoms', '3', 'L'],
['2009', 'Apr', '24', '24-Apr-09', 'Hershey', '0', 'Phantoms', '1', 'L']]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.100.31