Collecting Restaurant Reviews

This section concludes our studies of microformats—and Thai food—by briefly introducing hReview. Yelp is a popular service that implements hReview so that the ratings customers have left for restaurants can be exposed. Example 2-9 demonstrates how to extract hReview information as implemented by Yelp. A sample URL you might try is in the sample code and represents a Thai restaurant you definitely don’t want to miss if you ever have the opportunity to visit it.

Warning

Although the spec is pretty stable, hReview implementations seem to vary and include arbitrary deviations. In particular, Example 2-9 does not parse the reviewer as an hCard because Yelp’s implementation did not include it as such.

Example 2-9. Parsing hReview data for a Pad Thai recipe (microformats__yelp_hreview.py)

# -*- coding: utf-8 -*-

import sys
import re
import urllib2
import json
import HTMLParser
from BeautifulSoup import BeautifulSoup

# Pass in a URL that contains hReview info such as
# http://www.yelp.com/biz/bangkok-golden-fort-washington-2

url = sys.argv[1]

# Parse out some of the pertinent information for a Yelp review
# Unfortunately, the quality of hReview implementations varies
# widely so your mileage may vary. This code is *not* a spec
# parser by any stretch. See http://microformats.org/wiki/hreview

def parse_hreviews(url):
    try:
        page = urllib2.urlopen(url)
    except urllib2.URLError, e:
        print 'Failed to fetch ' + url
        raise e

    try:
        soup = BeautifulSoup(page)
    except HTMLParser.HTMLParseError, e:
        print 'Failed to parse ' + url
        raise e

    hreviews = soup.findAll(True, 'hreview')

    all_hreviews = []
    for hreview in hreviews:
        if hreview and len(hreview) > 1:

            # As of 1 Jan 2010, Yelp does not implement reviewer as an hCard, 
            # per the spec

            reviewer = hreview.find(True, 'reviewer').text  

            dtreviewed = hreview.find(True, 'dtreviewed').text
            rating = hreview.find(True, 'rating').find(True, 'value-title')['title']
            description = hreview.find(True, 'description').text
            item = hreview.find(True, 'item').text

            all_hreviews.append({
                'reviewer': reviewer,
                'dtreviewed': dtreviewed,
                'rating': rating,
                'description': description,
                })
    return all_hreviews

reviews = parse_hreviews(url)

# Do something interesting like plot out reviews over time
# or mine the text in the descriptions...

print json.dumps(reviews, indent=4)

Truncated sample results for Example 2-9 are shown in Example 2-10. They include the reviewer, which is parsed out of the hCard microformatted nodes, as its own object.

Example 2-10. Sample hReview results corresponding to Example 2-9

[
    {
        "reviewer": "Nick L.", 
        "description": "Probably the best Thai food in the metro area...", 
        "dtreviewed": "4/27/2009", 
        "rating": "5"
    }, 

    ...truncated...

]

Unfortunately, neither Yelp, nor the Food Network, nor anyone else provides specific enough information to tie reviewers together in very meaningful ways, but hopefully that will change soon, opening up additional possibilities for the social web. In the meantime, you might make the most with the data you have available and plot out the average rating for a restaurant over time to see if it has improved or declined. Another idea might be to mine the description fields in the text. See Chapters 7 and 8 for some fodder on how that might work.

Note

For brevity, the grunt work in massaging the JSON data into a single list of output and loading it into a spreadsheet isn’t shown here, but a spreadsheet of the data is available for download if you’re feeling especially lazy.

There’s no limit to the innovation that can happen when you combine geeks and food data, as evidenced by the popularity of the recently published Cooking for Geeks , also from O’Reilly. As the capabilities of food sites evolve to provide additional APIs, so will the innovations that we see in this space.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.231.106