Together we have been through a long process of learning Python data visualization development, handling data, and creating charts using the pygal
library. Now it's time to put these skills to work. Our goal for this chapter is to use our knowledge to create a chart with dynamic data from the Web. In this chapter, we will cover the following topics:
We are going to start with the RSS feed from Packt Publishing and create a chart using data from the RSS feed. This chart will specifically comprise how many article posts are made in a month; at this point, we are familiar with parsing XML from a location found on the Web using HTTP. Then, we are going to create our own dataset from this feed.
To accomplish this, we will need to perform the following tasks:
Like any programming task, the key to success in both ease of writing and code maintainability, is breaking up the tasks required to accomplish the job. For this, we will break up this task into Python modules.
Our first task is to pull the RSS feed from the Packt Publishing website into our Python code. For this, we can reuse our Python HTTP and XML parser example from Chapter 6, Importing Dynamic Data, but instead of grabbing the titles for each title
node, we will grab the date from pubDate
. The pubDate
object is an RSS standard convention to indicate the date and time in an XML-based RSS feed.
Let's modify our code from Chapter 6, Importing Dynamic Data, and grab the pubDate
object. We will create a new Python script file and call it LoadRssFeed.py
. Then, we will use this code in our editor of choice:
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #List of post dates. news_post_date = root.findall("channel//pubDate") '''Iterate in all our searched elements and print the inner text for each.''' for date in news_post_date: print(date.text) #Finally, close our open network. response.close() except Exception as e: #If we have an issue show a message and alert the user. print(e)
Notice rather than finding all titles, we are now finding pubDate
using the XPath path channel//pubDate
. We also updated our date's list name to news_post_date
. This will help clarify our code. Let's run the code and see our results:
Looking good; we can tell we have a structured date and time format, and now we can filter by what is in the string, but let's dig a little more into this. Like in most languages, Python also has a time
library that will allow for strings to be converted to datetime
Python objects. How can we tell if these values aren't date
objects already? Let's wrap our print (date.text)
method in a type
function to render the type of the object rather than the object's output, like this: print type(date.text)
. Let's rerun the code and take a look at the results:
Before moving on, we need to convert any string we take in to useable types in Python, for instance, we pull in the publication date from our RSS feed. Wouldn't it be nice to have some already made functions to format or search our dates? Well, Python has a built-in type for this called time
. Converting strings to time
objects is pretty easy, but first, we need to add our import
statement for time
at the top of our code, for example, import time
. Next, as our pubDate
node isn't multiple strings that we easily set up to parse, let's split these up to an array using the split()
method.
We will remove any commas in our string using the replace()
method. We can even run this and our output window will show brackets around each pubDate
with commas between each array item showing that we successfully split our single string to a string array.
Here we use a loop with a list of different time elements pulled from the RSS feed:
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #List of post dates. news_post_date = root.findall("channel//pubDate") print(news_post_date) '''Iterate in all our searched elements and print the inner text for each.''' for date in news_post_date: '''Create a variable striping out commas, and generating a new array item for every space.''' datestr_array = date.text.replace(',', '').split(' ') '''Show each array in the Command Prompt(Terminal).''' print(datestr_array) #Finally, close our open network. response.close() except Exception as e: '''If we have an issue show a message and alert the user.'''print(e)
Here, we can see as we loop through our our code, a list for each time
object, month, day, year, time, and so on; this will help up grab specific time values in relation to our RSS feed we've parsed:
The strptime()
method or strip time is a method found in the time
module, and it allows us to create a date
variable using our string arrays. All we need to do is specify the year, month, day, and time in our strptime()
method. Let's create a variable for our string array created in our for
loop. Then, create a date
type variable with our strptime()
method formatting it with our string array.
Take a look at the following code and notice how we structured the for
loop using news_post_date
list to match up our string array to show our list of dates received from the RSS feed, which we parse into Python time
objects with the strptime()
method. Let's go ahead and add the following code and take a look at our results:
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2, time from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Array of post dates news_post_date = root.findall("channel//pubDate") #Iterate in all our searched elements and print the inner text for each. for date in news_post_date: '''Create a variable striping out commas, and generating a new array item for every space.''' datestr_array = date.text.replace(',', '').split(' ') '''Create a formatted string to match up with our strptime method.''' formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4]) #Parse a time object from our string. blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S") print blog_datetime #Finally, close our open network. response.close() except Exception as e: #If we have an issue show a message and alert the user. print(e)
As we can see, each loop through the RSS feed shows a time.struct_time
object. The struct_time
object allows us to specify which part of the time
object we want to work with; let's print only the month to the console:
We can now do this easily by printing blog_datetime.tm_mon
, which references the tm_mon
named parameter from our struct_time
method. For example, here we get the number of the month for each post like this:
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree import time def get_all_rss_dates(): '''Create a global array to our function to save our month counts.''' month_count = [] try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Array of post dates. news_post_date = root.findall("channel//pubDate") '''Iterate in all our searched elements and print the inner text for each.''' for date in news_post_date: '''Create a variable striping out commas, and generating a new array item for every space.''' datestr_array = date.text.replace(',', '').split(' ') '''Create a formatted string to match up with our strptime method.''' formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4]) '''Parse a time object from our string.''' blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S") '''Add this date's month to our count array''' month_count.append(blog_datetime.tm_mon) #Finally, close our open network. response.close() except Exception as e: '''If we have an issue show a message and alert the user.''' print(e) for mth in month_count: print(mth) #Call our function. get_all_rss_dates()
The following screenshot shows the results of our script:
With this output, we can see the number 6
, which indicates June, and the number 5
, which indicates May. Great! We have now modified our code to consume data from the Web and display the same type data that we specify in our output. If you're curious about the string formats on blog_datetime
, you can reference the string format index, which I've included in the following table. There is also a detailed list available at https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior.
With our data types in order, we want to count how many posts happen in a given month. For this, we will need to put each post into a grouped list we can use outside our for
loop. We can do this by creating an empty array outside the for
loop and add each blog_datetime.tm_mon
object to our array.
Let's do this in the following code, but first, we will wrap this in a function as our code files are starting to get a bit large. If you remember back in Chapter 2, Python Refresher, we wrapped our large code blocks in functions so that we can reuse or clean up our code. We will wrap our code in the get_all_rss_dates
name function and call it on the last line. Also, we will add the month_count
array variable prior to our try
catch
ready-to-append values, which we did in our for
loop, then print the month_count
array variable. Let's take a look at what this renders:
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree import time def get_all_rss_dates(): #create a global array to our function to save our month counts. month_count = [] try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Array of post dates. news_post_date = root.findall("channel//pubDate") '''Iterate in all our searched elements and print the inner text for each.''' for date in news_post_date: '''Create a variable striping out commas, and generating a new array item for every space.''' datestr_array = date.text.replace(',', '').split(' ') '''Create a formatted string to match up with our strptime method.''' formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4]) '''Parse a time object from our string.''' blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S") '''Add this dates month to our count array''' month_count.append(blog_datetime.tm_mon) '''Finally, close our open network.''' response.close() except Exception as e: '''If we have an issue show a message and alert the user.''' print(e) print month_count #Call our function get_all_rss_dates()
The following is a screenshot that shows our list of months and the numbers to correspond with the month. In this case, 5
is for the month of May and 6
is for the month of June (your numbers may change depending on the month):
Now that our array is ready to work with, let's count the posts in both June and May. At the time of writing this book, we have seven posts in June and quite a lot in May.
Let's print out the number of blog posts the month of May has had on the Packt Publishing website news feed. To do this, we will use the count()
method, which lets us search our array for a specific value. In this case, 5
is the value we are looking for:
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree import time def get_all_rss_dates(): '''create a global array to our function to save our month counts.''' month_count = [] try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Array of post dates. news_post_date = root.findall("channel//pubDate") '''Iterate in all our searched elements and print the inner text for each. ''' for date in news_post_date: '''Create a variable striping out commas, and generating a new array item for every space.''' datestr_array = date.text.replace(',', '').split(' ') '''Create a formatted string to match up with our strptime method.''' formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4]) '''Parse a time object from our string.''' blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S") '''Add this date's month to our count array''' month_count.append(blog_datetime.tm_mon) '''Finally, close our open network. ''' response.close() except Exception as e: '''If we have an issue show a message and alert the user. ''' print(e) print month_count.count(5) #Call our function get_all_rss_dates()
As we can see in the following console, we get the number of posts that were written in the given month (in the screen and code, this is the month of May):
In our output window, we can see our result was 43
posts for the month of May in 2014. What if we change our count search to June, or rather, 6
in our code? Let's update our code and rerun:
Our output shows 7
as the total blog posts for the month of June. At this point, we've tested our code and now we have a working dataset to display for both May and June.
3.138.105.255