Chapter 7. Putting It All Together

Together we have been through a long process of learning Python data visualization development, handling data, and creating charts using the pygal library. Now it's time to put these skills to work. Our goal for this chapter is to use our knowledge to create a chart with dynamic data from the Web. In this chapter, we will cover the following topics:

  • Import 2 months of RSS blog posts from the Web
  • Format our RSS data for a new bar chart's dataset
  • Build a simple bar chart to display blog posts for the past two months, passing in the number of posts
  • Create a main application script to handle the execution and separate our code into modules, which we will import into our main script

Chart usage for a blog

We are going to start with the RSS feed from Packt Publishing and create a chart using data from the RSS feed. This chart will specifically comprise how many article posts are made in a month; at this point, we are familiar with parsing XML from a location found on the Web using HTTP. Then, we are going to create our own dataset from this feed.

To accomplish this, we will need to perform the following tasks:

  • Find out how many posts are made in a given month
  • Filter the count of posts for each month
  • Finally, generate a chart based on this data for both months

Like any programming task, the key to success in both ease of writing and code maintainability, is breaking up the tasks required to accomplish the job. For this, we will break up this task into Python modules.

Getting our data in order

Our first task is to pull the RSS feed from the Packt Publishing website into our Python code. For this, we can reuse our Python HTTP and XML parser example from Chapter 6, Importing Dynamic Data, but instead of grabbing the titles for each title node, we will grab the date from pubDate. The pubDate object is an RSS standard convention to indicate the date and time in an XML-based RSS feed.

Let's modify our code from Chapter 6, Importing Dynamic Data, and grab the pubDate object. We will create a new Python script file and call it LoadRssFeed.py. Then, we will use this code in our editor of choice:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
from xml.etree import ElementTree

try:
    #Open the file via HTTP.
    response = urllib2.urlopen('http://www.packtpub.com/rss.xml')
    tree = ElementTree.parse(response)
    root = tree.getroot()

    #List of post dates.

    news_post_date = root.findall("channel//pubDate")

    '''Iterate in all our searched elements and print the inner text for each.'''
    for date in news_post_date:
       print(date.text)

       #Finally, close our open network.
       response.close()

except Exception as e:
    #If we have an issue show a message and alert the user.
    print(e)

Notice rather than finding all titles, we are now finding pubDate using the XPath path channel//pubDate. We also updated our date's list name to news_post_date. This will help clarify our code. Let's run the code and see our results:

Getting our data in order

Looking good; we can tell we have a structured date and time format, and now we can filter by what is in the string, but let's dig a little more into this. Like in most languages, Python also has a time library that will allow for strings to be converted to datetime Python objects. How can we tell if these values aren't date objects already? Let's wrap our print (date.text) method in a type function to render the type of the object rather than the object's output, like this: print type(date.text). Let's rerun the code and take a look at the results:

Getting our data in order

Converting date strings to dates

Before moving on, we need to convert any string we take in to useable types in Python, for instance, we pull in the publication date from our RSS feed. Wouldn't it be nice to have some already made functions to format or search our dates? Well, Python has a built-in type for this called time. Converting strings to time objects is pretty easy, but first, we need to add our import statement for time at the top of our code, for example, import time. Next, as our pubDate node isn't multiple strings that we easily set up to parse, let's split these up to an array using the split() method.

We will remove any commas in our string using the replace() method. We can even run this and our output window will show brackets around each pubDate with commas between each array item showing that we successfully split our single string to a string array.

Here we use a loop with a list of different time elements pulled from the RSS feed:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
from xml.etree import ElementTree

try:
    #Open the file via HTTP.
    response = urllib2.urlopen('http://www.packtpub.com/rss.xml')
    tree = ElementTree.parse(response)
    root = tree.getroot()

    #List of post dates.
    news_post_date = root.findall("channel//pubDate")
    print(news_post_date)
    '''Iterate in all our searched elements and print the inner text for each.'''
    for date in news_post_date:
        '''Create a variable striping out commas, and generating a new array item for every space.'''
        datestr_array = date.text.replace(',', '').split(' ')
        '''Show each array in the Command Prompt(Terminal).'''
        print(datestr_array)
        
    #Finally, close our open network.
    response.close()
    
except Exception as e:
    '''If we have an issue show a message and alert the user.'''print(e)

Here, we can see as we loop through our our code, a list for each time object, month, day, year, time, and so on; this will help up grab specific time values in relation to our RSS feed we've parsed:

Converting date strings to dates

Using strptime

The strptime() method or strip time is a method found in the time module, and it allows us to create a date variable using our string arrays. All we need to do is specify the year, month, day, and time in our strptime() method. Let's create a variable for our string array created in our for loop. Then, create a date type variable with our strptime() method formatting it with our string array.

Take a look at the following code and notice how we structured the for loop using news_post_date list to match up our string array to show our list of dates received from the RSS feed, which we parse into Python time objects with the strptime() method. Let's go ahead and add the following code and take a look at our results:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2, time
from xml.etree import ElementTree
 
try:
    #Open the file via HTTP.
    response = urllib2.urlopen('http://www.packtpub.com/rss.xml')
    tree = ElementTree.parse(response)
    root = tree.getroot()

    #Array of post dates
    news_post_date = root.findall("channel//pubDate")

    #Iterate in all our searched elements and print the inner text for each.
    for date in news_post_date:
        '''Create a variable striping out commas, and generating a new array item for every space.'''
        datestr_array = date.text.replace(',', '').split(' ')
        '''Create a formatted string to match up with our strptime method.'''
        formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4])

        #Parse a time object from our string.
        blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S")
 
        print blog_datetime

    #Finally, close our open network.
    response.close()

except Exception as e:
    #If we have an issue show a message and alert the user.
 print(e)

As we can see, each loop through the RSS feed shows a time.struct_time object. The struct_time object allows us to specify which part of the time object we want to work with; let's print only the month to the console:

Using strptime

We can now do this easily by printing blog_datetime.tm_mon, which references the tm_mon named parameter from our struct_time method. For example, here we get the number of the month for each post like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
from xml.etree import ElementTree
import time

def get_all_rss_dates():

    '''Create a global array to our function to save our month counts.'''
    month_count = []

    try:
        #Open the file via HTTP.
        response = urllib2.urlopen('http://www.packtpub.com/rss.xml')
        tree = ElementTree.parse(response)
        root = tree.getroot()
    
        #Array of post dates.
        news_post_date = root.findall("channel//pubDate")
    
        '''Iterate in all our searched elements and print the inner text for each.'''
        for date in news_post_date:
            '''Create a variable striping out commas, and generating a new array item for every space.'''
            datestr_array = date.text.replace(',', '').split(' ')
            
            '''Create a formatted string to match up with our strptime method.'''
            formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4])
            
            '''Parse a time object from our string.'''
            blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S")

            '''Add this date's month to our count array'''
            month_count.append(blog_datetime.tm_mon)
        
        #Finally, close our open network.
        response.close()
    
    except Exception as e:
        '''If we have an issue show a message and alert the user.'''
        print(e)
    for mth in month_count:
        print(mth)

#Call our function.
get_all_rss_dates()

The following screenshot shows the results of our script:

Using strptime

With this output, we can see the number 6, which indicates June, and the number 5, which indicates May. Great! We have now modified our code to consume data from the Web and display the same type data that we specify in our output. If you're curious about the string formats on blog_datetime, you can reference the string format index, which I've included in the following table. There is also a detailed list available at https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior.

Placeholder

Description

%a

This is the abbreviated weekday name

%A

This is the weekday name without abbreviation

%b

This is the abbreviated month name

%B

This is the month name without abbreviation

%c

This is the preferred date and time representation

%C

This is the century of the date (2000 would return 20, 1900 would return 19)

%d

This shows the day of the month

%D

This is same as %m/%d/%y

%g

This is just like %G, but without the century

%G

This gives the four-digit year, such as 2014

%H

This shows the hour in the 24 hour format (00 to 23); most blogs use a 24-hour clock

%I

This gives the hour in the 12 hour format (01 to 12)

%j

Day of the year (001 to 366)

%m

Month (01 to 12)

%M

Minute

%p

Uses either a.m. or p.m.

%S

This displays the seconds of the date

%T

This gives the current time, which is equal to %H:%M:%S

%W

This is the week number of the current year

%w

Day of the week as a decimal, Sunday=0

%x

This gives the preferred date representation without the time

%X

This gives the preferred time representation without the date

%y

This only returns the last two digits of the year (for example, 2014 would return 14)

%Y

This gives the year including the century (if the year is 2014, the output would be 2014)

%Z or %z

This gives the name of the time zone or abbreviation for the time zone (for example, Eastern standard time, or EST)

Saving the output as a counted array

With our data types in order, we want to count how many posts happen in a given month. For this, we will need to put each post into a grouped list we can use outside our for loop. We can do this by creating an empty array outside the for loop and add each blog_datetime.tm_mon object to our array.

Let's do this in the following code, but first, we will wrap this in a function as our code files are starting to get a bit large. If you remember back in Chapter 2, Python Refresher, we wrapped our large code blocks in functions so that we can reuse or clean up our code. We will wrap our code in the get_all_rss_dates name function and call it on the last line. Also, we will add the month_count array variable prior to our try catch ready-to-append values, which we did in our for loop, then print the month_count array variable. Let's take a look at what this renders:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
from xml.etree import ElementTree
import time

def get_all_rss_dates():
    #create a global array to our function to save our month counts.
    month_count = []

    try:
        #Open the file via HTTP.
        response = urllib2.urlopen('http://www.packtpub.com/rss.xml')
        tree = ElementTree.parse(response)
        root = tree.getroot()

        #Array of post dates.
        news_post_date = root.findall("channel//pubDate")

        '''Iterate in all our searched elements and print the inner text for each.'''
        for date in news_post_date:
            '''Create a variable striping out commas, and generating a new array item for every space.'''
            datestr_array = date.text.replace(',', '').split(' ')
            '''Create a formatted string to match up with our strptime method.'''
            formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4])

            '''Parse a time object from our string.'''
            blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S")

            '''Add this dates month to our count array'''
            month_count.append(blog_datetime.tm_mon)

            '''Finally, close our open network.'''
            response.close()

    except Exception as e:
        '''If we have an issue show a message and alert the user.'''
        print(e)

    print month_count

#Call our function
get_all_rss_dates()

The following is a screenshot that shows our list of months and the numbers to correspond with the month. In this case, 5 is for the month of May and 6 is for the month of June (your numbers may change depending on the month):

Saving the output as a counted array

Counting the array

Now that our array is ready to work with, let's count the posts in both June and May. At the time of writing this book, we have seven posts in June and quite a lot in May.

Let's print out the number of blog posts the month of May has had on the Packt Publishing website news feed. To do this, we will use the count() method, which lets us search our array for a specific value. In this case, 5 is the value we are looking for:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import urllib2 
from xml.etree import ElementTree 
import time

def get_all_rss_dates():

    '''create a global array to our function to save our month counts.'''
    month_count = []

    try: 
        #Open the file via HTTP. 
        response = urllib2.urlopen('http://www.packtpub.com/rss.xml') 
        tree = ElementTree.parse(response) 
        root = tree.getroot() 
    
        #Array of post dates.
        news_post_date = root.findall("channel//pubDate") 
    
        '''Iterate in all our searched elements and print the inner text for each. '''
        for date in news_post_date: 
            '''Create a variable striping out commas, and generating a new array item for every space.'''
            datestr_array = date.text.replace(',', '').split(' ')
            
            '''Create a formatted string to match up with our strptime method.'''
            formatted_date = "{0} {1}, {2}, {3}".format(datestr_array[2], datestr_array[1], datestr_array[3], datestr_array[4])
            
            '''Parse a time object from our string.'''
            blog_datetime = time.strptime(formatted_date, "%b %d, %Y, %H:%M:%S")

            '''Add this date's month to our count array'''
            month_count.append(blog_datetime.tm_mon)

        '''Finally, close our open network. '''
        response.close() 

    except Exception as e: 
        '''If we have an issue show a message and alert the user. '''
        print(e)

    print month_count.count(5)

#Call our function
get_all_rss_dates()

As we can see in the following console, we get the number of posts that were written in the given month (in the screen and code, this is the month of May):

Counting the array

In our output window, we can see our result was 43 posts for the month of May in 2014. What if we change our count search to June, or rather, 6 in our code? Let's update our code and rerun:

Counting the array

Our output shows 7 as the total blog posts for the month of June. At this point, we've tested our code and now we have a working dataset to display for both May and June.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.105.255