Search Engine in 100 Lines of Code

Let's build on the previous example and create an actual search engine. We can use parse actions to compile the search string into an intermediate data structure, and then use that structure to generate and evaluate Python sets to extract matching items by keyword.

The example data set this time will not be all the wooden, steel, or iron things in the universe that are red, blue, or lime green, but a collection of recipes with their respective ingredients.

These will be defined using simple Python lists:

recipes = "Tuna casserole/Hawaiian pizza/Chicken a la King/"
    "Pepperoni pizza/Baked ham/Tuna melt/Eggs Benedict"
        .split("/")
ingredients = "eggs/pineapple/pizza crust/pepperoni/ham/bacon/"
    "English muffins/noodles/tuna/cream of mushroom soup/chicken/"
    "mixed vegetables/cheese/tomato sauce/mayonnaise/Hollandaise sauce"
        .split("/")

The contents of each recipe will be defined using a list of tuples, each tuple mapping the index of a recipe to the index of one of its ingredients:

recipe_ingredients_map = [
    (0,8),(0,9),(0,7),(1,2),(1,1),(1,4),(2,7),(2,9),(2,10),(2,11),
    (3,3),(3,2),(3,12),(3,13),(4,1),(4,4),(5,6),(5,8),(5,14),(5,12),
    (6,6),(6,0),(6,12),(6,4),(6,15),
    ]

So, recipe 0 ("Tuna casserole") contains ingredients 8, 9, and 7 ("tuna," "cream of mushroom soup," and "noodles"). Not exactly the Cordon Bleu, but enough to get us started.

Our search engine will focus on finding recipes that contain given ingredients, so we'll define a Python dict named recipesByIngredient, which will serve as our database index, to make it easier to find which recipes reference a particular ingredient:

recipesByIngredient = dict((i,[]) for i in ingredients)
for recIndex,ingIndex in recipe_ingredients_map:
    recipesByIngredient[ ingredients[ingIndex] ].append( recipes[recIndex] )

With our basic data reference established, we can work on applying the search string parser to extract data by search string.

The BNF for the search string is exactly as before, and we can reuse the operatorPrecedence implementation as is:

searchExpr ::= searchAnd [ OR searchAnd ]...
searchAnd  ::= searchTerm [ AND searchTerm ]...
searchTerm ::= [NOT] ( single-word | quotedString | '(' searchExpr ')' )

and_ = CaselessLiteral("and")
or_  = CaselessLiteral("or")
not_ = CaselessLiteral("not")
searchTerm = Word(alphanums) | quotedString.setParseAction( removeQuotes )
searchExpr = operatorPrecedence( searchTerm,
       [
       (not_, 1, opAssoc.RIGHT),
       (and_, 2, opAssoc.LEFT),
       (or_, 2, opAssoc.LEFT),
       ])

The next step is to modify the operation descriptions in the operator-Precedence definition to add parse actions to perform the creation of the search data structure. These parse actions will be different from previous parse actions we have created. Instead of modifying the string tokens or returning new string tokens, these parse actions will be class constructors that take a ParseResults object and return an object that will perform some form of data search evaluation. Each of the three operators NOT, AND, and OR will have its own search evaluation class, SearchNot, SearchAnd, and SearchOr.

Let's abstract the construction of these objects into two classes, BinaryOperation and UnaryOperation. The purpose of these classes will simply be to have an initializer that pulls the correct arguments from the tokens passed to a parse action, and saves them to suitable instance attributes:

class UnaryOperation(object):
    def __init__(self, tokens):
        self.op, self.a = tokens[0]

This simple initializer will extract the operator into an attribute named op, and the argument into attribute a.

class BinaryOperation(object):
    def __init__(self, t):
        self.op = tokens[0][1]
        self.operands = tokens[0][0::2]

The initializer for BinaryOperation looks a little more complicated. Binary operations in operatorPrecedence can return multiple operation/operands if multiple operations of the same precedence level occur together. For instance, a binary operation for addition will parse "a+b+c" as [ 'a', '+', 'b', '+', 'c' ], not [[ 'a', '+', 'b' ], '+', 'c' ], so BinaryOperation needs to be able to recognize and process a chain of like operations, not just a simple a op b operation.

The Python [0::2] slicing notation indicates that the operands will be taken from the list of tokens beginning with the 0th element, and then stepping by 2 until the end of the list. So, this will process a token list such as ['a', '+', 'b', '+', 'c'], and give us the tokens ['a', 'b', 'c'].

For each operator type, we create a handler class that will derive from one of these classes. For now, let's just define handler classes that can display their own repr strings:

class SearchAnd(BinaryOperation):
    def __repr__(self):
        return "AND:(%s)" % (",".join(str(oper) for oper in self.operands))


class SearchOr(BinaryOperation):
    def __repr__(self):
        return "OR:(%s)" % (",".join(str(oper) for oper in self.operands))

class SearchNot(UnaryOperation):
    def __repr__(self):
        return "NOT:(%s)" % str(self.a)

To construct these objects during a parse action, you might consider creating parse actions that look something like:

def makeSearchAnd(tokens):
    searchAnd = SearchAnd(tokens)
    return searchAnd
...

and then pass makeSearchAnd, etc., into the operatorPrecedence call:

searchExpr = operatorPrecedence( searchTerm,
       [
       (not_, 1, opAssoc.RIGHT, makeSearchNot),
       (and_, 2, opAssoc.LEFT, makeSearchAnd),
       (or_, 2, opAssoc.LEFT, makeSearchOr),
       ])

In fact, the classes themselves are sufficient objects to pass as parse actions, since their initializers already have the proper argument lists to accept the arguments passed to a parse action. So, we can discard the makeSearchXXX methods, and just pass the classes directly as the parse actions for operatorPrecedence:

searchExpr = operatorPrecedence( searchTerm,
       [
       (not_, 1, opAssoc.RIGHT, SearchNot),
       (and_, 2, opAssoc.LEFT, SearchAnd),
       (or_, 2, opAssoc.LEFT, SearchOr),
       ])

For completeness, we will create one more class, to be added as a parse action for the base operand search terms themselves:

class SearchTerm(object):
    def __init__(self, tokens):
        self.term = tokens[0]
    def __repr__(self):
        return self.term

searchTerm.setParseAction( SearchTerm )

If we add these classes and the modified operatorPrecedence call to the existing parser code, we can visualize the search expressions data structures now being built. Once again, we will need some sample search strings to test the parser:

pineapple
pineapple and 'pizza crust'
noodles and tuna or ham
noodles and (tuna or ham)
pineapple and noodles
tuna or ham or pineapple
tuna or (ham and not pineapple)
not tuna
not (pineapple or tuna)

Since our returned objects implement suitable repr methods, we can just print the ParseResults object returned from calling parseString, and see the data structures that are created:

pineapple -> pineapple
pineapple and 'pizza crust' -> AND:(pineapple,pizza crust)
noodles and tuna or ham -> OR:(AND:(noodles,tuna),ham)
noodles and (tuna or ham) -> AND:(noodles,OR:(tuna,ham))
pineapple and noodles -> AND:(pineapple,noodles)
tuna or ham or pineapple -> OR:(tuna,ham,pineapple)
tuna or (ham and not pineapple) -> OR:(tuna,AND:(ham,NOT:(pineapple)))
not tuna -> NOT:(tuna)
not (pineapple or tuna) -> NOT:(OR:(pineapple,tuna))

Now that we are parsing the search string into a data structure of objects, we can put those objects to work searching for matching data.

The design we will implement will be very similar to the one used in the Venn diagrams that illustrated the search results matching the test strings in the first search string example. Each region in the diagram represents a set of objects. We will use Python sets to model these separate groups. For each SearchXXX class, we will generate a fragment Python set expression, using Python set operations such as & for intersection, | for union, and - for negation. The concatenation of these fragments will give us an overall set expression, which can be evaluated using the Python eval command. The results of the evaluated set expression will be the desired results of the original search string. We will in essence have compiled the search string into a Python expression, which we then execute to get the requested search results.

SearchTerm is the simplest to implement and will set the stage for the more complex operations. When a term such as "pineapple" is included in the search string, we would like to select all of the recipes that include "pineapple" as one of their ingredients. Given the in-memory mini-database we created at the beginning of this section, we can find the set of all recipes directly from the global dict variable recipesByIngredient:

set( recipesByIngredient['pineapple'] )

We should also guard against searches for ingredients that are not in the database so that if a term is not in recipesByIngredient, we return an empty set.

So, the implementation of generateSetExpression for SearchTerm is:

    def generateSetExpression(self):
        if self.term in recipesByIngredient:
            return "set(recipesByIngredient['%s'])" % self.term
        else:
            return "set()"

SearchAnd and SearchOr are almost as easy. Since these binary operations have an operands attribute initialized to contain the terms or subexpressions that are to be "and"ed or "or"ed, the generateSetExpression method for SearchAnd and SearchOr can be the results of the operands' generateSetExpression methods, joined by & and |, respectively:

# class SearchAnd
    def generateSetExpression(self):
        return "(%s)" % 
                  " & ".join(oper.generateSetExpression() for oper in self.operands)

# class SearchOr
    def generateSetExpression(self):
        return "(%s)" % 
                  " | ".join(oper.generateSetExpression() for oper in self.operands)

SearchNot is a bit more difficult. The set expression for the set of items that are not in set X is the set of all items minus X. In Python, we would implement ∼X as:

set(everything) - set(X)

To implement generateSetExpression for SearchNot, we will return the negation of the operand's set expression as the operand's set subtracted from the set of all recipes:

# class SearchNot
    def generateSetExpression(self):
        return "(set(recipes) - %s)" % self.a.generateSetExpression()

That completes the work on the search classes. Let's rerun the tests to see how the generated set expressions look:

pineapple ->
     set(recipesByIngredient['pineapple'])

pineapple and 'pizza crust' ->
     (set(recipesByIngredient['pineapple']) & set(recipesByIngredient['pizza crust']))

noodles and tuna or ham ->
     ((set(recipesByIngredient['noodles']) & set(recipesByIngredient['tuna']))
| set(recipesByIngredient['ham']))

noodles and (tuna or ham) ->
     (set(recipesByIngredient['noodles']) & (set(recipesByIngredient['tuna']) |
set(recipesByIngredient['ham'])))

pineapple and noodles ->
     (set(recipesByIngredient['pineapple']) & set(recipesByIngredient['noodles']))

tuna or ham or pineapple ->
     (set(recipesByIngredient['tuna']) | set(recipesByIngredient['ham']) |
set(recipesByIngredient['pineapple']))

tuna or (ham and not pineapple) ->
     (set(recipesByIngredient['tuna']) | (set(recipesByIngredient['ham']) &
(set(recipes) - set(recipesByIngredient['pineapple']))))

not tuna ->
     (set(recipes) - set(recipesByIngredient['tuna']))

not (pineapple or tuna) ->
     (set(recipes) - (set(recipesByIngredient['pineapple']) |
set(recipesByIngredient['tuna'])))

anchovies ->
     set()

(I added the last search string as a test of the nonexistent ingredient search.)

The only remaining step for each search string is to evaluate its Python set expression, and list out the resulting items. The following code lists the complete parser, including test data initialization and test search strings:

from pyparsing import *

# populate ingredients->recipes "database"
recipes = "Tuna casserole/Hawaiian pizza/Chicken a la King/"
    "Pepperoni pizza/Baked ham/Tuna melt/Eggs Benedict"
        .split("/")
ingredients = "eggs/pineapple/pizza crust/pepperoni/ham/bacon/"
    "English muffins/noodles/tuna/cream of mushroom soup/chicken/"
    "mixed vegetables/cheese/tomato sauce/mayonnaise/Hollandaise sauce"
        .split("/")
recipe_ingredients_map = [
    (0,8),(0,9),(0,7),(1,2),(1,1),(1,4),(2,7),(2,9),(2,10),(2,11),
    (3,3),(3,2),(3,12),(3,13),(4,1),(4,4),(5,6),(5,8),(5,14),(5,12),
    (6,6),(6,0),(6,12),(6,4),(6,15),
    ]
recipesByIngredient = dict((i,[]) for i in ingredients)
for recIndex,ingIndex in recipe_ingredients_map:
    recipesByIngredient[ ingredients[ingIndex] ].append( recipes[recIndex] )

# classes to be constructed at parse time, from intermediate ParseResults
class UnaryOperation(object):
    def __init__(self, t):
        self.op, self.a = t[0]

class BinaryOperation(object):
    def __init__(self, t):
        self.op = t[0][1]
        self.operands = t[0][0::2]

class SearchAnd(BinaryOperation):
    def generateSetExpression(self):
        return "(%s)" % 
                  " & ".join(oper.generateSetExpression() for oper in self.operands)
    def __repr__(self):
        return "AND:(%s)" % (",".join(str(oper) for oper in self.operands))

class SearchOr(BinaryOperation):
    def generateSetExpression(self):
        return "(%s)" % 
                  " | ".join(oper.generateSetExpression() for oper in self.operands)
    def __repr__(self):
        return "OR:(%s)" % (",".join(str(oper) for oper in self.operands))

class SearchNot(UnaryOperation):
    def generateSetExpression(self):
        return "(set(recipes) - %s)" % self.a.generateSetExpression()
    def __repr__(self):
        return "NOT:(%s)" % str(self.a)

class SearchTerm(object):
    def __init__(self, tokens):
        self.term = tokens[0]
    def generateSetExpression(self):
        if self.term in recipesByIngredient:
            return "set(recipesByIngredient['%s'])" % self.term
        else:
            return "set()"
    def __repr__(self):
        return self.term

# define the grammar
and_ = CaselessLiteral("and")
or_  = CaselessLiteral("or")
not_ = CaselessLiteral("not")
searchTerm = Word(alphas) | quotedString.setParseAction( removeQuotes )
searchTerm.setParseAction(SearchTerm)
searchExpr = operatorPrecedence( searchTerm,
       [
       (not_, 1, opAssoc.RIGHT, SearchNot),
       (and_, 2, opAssoc.LEFT, SearchAnd),
       (or_, 2, opAssoc.LEFT, SearchOr),
       ])

# test the grammar and selection logic
test = """
    pineapple
    pineapple and 'pizza crust'
    noodles and tuna or ham
    noodles and (tuna or ham)
    pineapple and noodles
    tuna or ham or pineapple
    tuna or (ham and not pineapple)
    not tuna
    not (pineapple or tuna)
    anchovies""".splitlines()

for t in test:
    try:
        evalStack = (searchExpr + stringEnd).parseString(t)[0]
    except ParseException, pe:
        print "Invalid search string"
        continue
    print "Search string:", t
    # print "Eval stack:    ", evalStack
    evalExpr = evalStack.generateSetExpression()
    # print "Eval expr:    ", evalExpr
    matchingRecipes = eval(evalExpr)
    if matchingRecipes:
        for r in matchingRecipes: print "-", r
    else:
        print "  (none)"
    print

And here are the test results:

Search string:     pineapple
- Baked ham
- Hawaiian pizza

Search string:     pineapple and 'pizza crust'
- Hawaiian pizza

Search string:     noodles and tuna or ham
- Baked ham
- Eggs Benedict
- Hawaiian pizza
- Tuna casserole

Search string:     noodles and (tuna or ham)
- Tuna casserole

Search string:     pineapple and noodles
  (none)

Search string:     tuna or ham or pineapple
- Eggs Benedict
- Tuna melt
- Hawaiian pizza
- Tuna casserole
- Baked ham

Search string:     tuna or (ham and not pineapple)
- Eggs Benedict
- Tuna melt
- Tuna casserole

Search string:     not tuna
- Pepperoni pizza
- Baked ham
- Eggs Benedict
- Hawaiian pizza
- Chicken a la King

Search string:     not (pineapple or tuna)
- Pepperoni pizza
- Eggs Benedict
- Chicken a la King

Search string:     anchovies
  (none)

At the end, we find that we have created a complete search engine in less than 100 lines of code—time to go chase some venture capital!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.239.226