Slicing and Dicing Recipes (for the Health of It)

Since Google’s Rich Snippets initiative took off, there’s been an ever-increasing awareness of microformats, and many of the most popular foodie websites have made solid progress in exposing recipes and reviews with hRecipe and hReview. Consider the potential for a fictitious online dating service that crawls blogs and other social hubs, attempting to pair people together for dinner dates. One could reasonably expect that having access to enough geo and hRecipe information linked to specific people would make a profound difference in the “success rate” of first dates. People could be paired according to two criteria: how close they live to one another and what kinds of foods they eat. For example, you might expect a dinner date between two individuals who prefer to cook vegetarian meals with organic ingredients to go a lot better than a date between a BBQ-lover and a vegan. Dining preferences and whether specific types of allergens or organic ingredients are used could be useful clues to power the right business idea. While we won’t be trying to launch a new online data service, we’ll get the ball rolling in case you decide to take this idea and run with it.

The Food Network is one of many online sites that’s really embracing microformat initiatives for the betterment of the entire Web, exposing recipe information in the hRecipe microformat and using the hReview microformat for reviews of the recipes.[11] This section demonstrates how search engines (or you) might parse out the structured data from recipes and reviews contained in Food Network web pages for indexing or analyzing. Although we won’t do any analysis of the free text in the recipes or reviews, or permanently store the information extracted, later chapters will demonstrate how to do these things if you’re interested. In particular, Chapter 3 introduces CouchDB, a great way to store and share data (and analysis) you extract from microformat-powered web content, and Chapter 7 introduces some fundamentals for natural language processing (NLP) that you can use to gain a deeper understanding of the content in the reviews. (Coincidentally, it turns out that a recipe even shows up in that chapter.)

An adaptation of Example 2-6 that parses out hRecipe-formatted data is shown in Example 2-7.

Example 2-7. Parsing hRecipe data for a Pad Thai recipe (microformats__foodnetwork_hrecipe.py)

# -*- coding: utf-8 -*-

import sys
import urllib2
import json
import HTMLParser
import BeautifulSoup

# Pass in a URL such as
# http://www.foodnetwork.com/recipes/alton-brown/pad-thai-recipe/index.html

url = sys.argv[1]

# Parse out some of the pertinent information for a recipe
# See http://microformats.org/wiki/hrecipe


def parse_hrecipe(url):
    try:
        page = urllib2.urlopen(url)
    except urllib2.URLError, e:
        print 'Failed to fetch ' + url
        raise e

    try:
        soup = BeautifulSoup.BeautifulSoup(page)
    except HTMLParser.HTMLParseError, e:
        print 'Failed to parse ' + url
        raise e

    hrecipe = soup.find(True, 'hrecipe')

    if hrecipe and len(hrecipe) > 1:
        fn = hrecipe.find(True, 'fn').string
        author = hrecipe.find(True, 'author').find(text=True)
        ingredients = [i.string 
                            for i in hrecipe.findAll(True, 'ingredient') 
                                if i.string is not None]

        instructions = []
        for i in hrecipe.find(True, 'instructions'):
            if type(i) == BeautifulSoup.Tag:
                s = ''.join(i.findAll(text=True)).strip()
            elif type(i) == BeautifulSoup.NavigableString:
                s = i.string.strip()
            else:
                continue

            if s != '': 
                instructions += [s]

        return {
            'name': fn,
            'author': author,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    else:
        return {}


recipe = parse_hrecipe(url)
print json.dumps(recipe, indent=4)

For a sample URL such as Alton Brown’s acclaimed Pad Thai recipe, you should get the results shown in Example 2-8.

Example 2-8. Parsed results for the Pad Thai recipe from Example 2-7

{
    "instructions": [
        "Place the tamarind paste in the boiling water and set aside ...", 
        "Combine the fish sauce, palm sugar, and rice wine vinegar in ...", 
        "Place the rice stick noodles in a mixing bowl and cover with ...", 
        "Press the tamarind paste through a fine mesh strainer and add ...", 
        "Place a wok over high heat. Once hot, add 1 tablespoon of the ...", 
        "If necessary, add some more peanut oil to the pan and heat until ..."
    ], 
    "ingredients": [
        "1-ounce tamarind paste", 
        "3/4 cup boiling water", 
        "2 tablespoons fish sauce", 
        "2 tablespoons palm sugar", 
        "1 tablespoon rice wine vinegar", 
        "4 ounces rice stick noodles", 
        "6 ounces Marinated Tofu, recipe follows", 
        "1 to 2 tablespoons peanut oil", 
        "1 cup chopped scallions, divided", 
        "2 teaspoons minced garlic", 
        "2 whole eggs, beaten", 
        "2 teaspoons salted cabbage", 
        "1 tablespoon dried shrimp", 
        "3 ounces bean sprouts, divided", 
        "1/2 cup roasted salted peanuts, chopped, divided", 
        "Freshly ground dried red chile peppers, to taste", 
        "1 lime, cut into wedges"
    ], 
    "name": "Pad Thai", 
    "author": "Recipe courtesy Alton Brown, 2005"
    }

Although it’s not really a form of social analysis, it could be interesting to analyze variations of the same recipe and see whether there are any correlations between the appearance or lack of certain ingredients and ratings/reviews for the recipes. For example, you might try to pull down a few different Pad Thai recipes and determine which ingredients are common to all recipes and which are less common.



[11] In mid 2010, The Food Network implemented hReview in much the same fashion as Yelp, which is introduced in the next section; however, as of early January 2011, Food Network’s implementation changed to include only hreview-aggregate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.103.122