Let's build on the previous example and create an actual search engine. We can use parse actions to compile the search string into an intermediate data structure, and then use that structure to generate and evaluate Python sets to extract matching items by keyword.
The example data set this time will not be all the wooden, steel, or iron things in the universe that are red, blue, or lime green, but a collection of recipes with their respective ingredients.
These will be defined using simple Python lists:
recipes = "Tuna casserole/Hawaiian pizza/Chicken a la King/" "Pepperoni pizza/Baked ham/Tuna melt/Eggs Benedict" .split("/") ingredients = "eggs/pineapple/pizza crust/pepperoni/ham/bacon/" "English muffins/noodles/tuna/cream of mushroom soup/chicken/" "mixed vegetables/cheese/tomato sauce/mayonnaise/Hollandaise sauce" .split("/")
The contents of each recipe will be defined using a list of tuples, each tuple mapping the index of a recipe to the index of one of its ingredients:
recipe_ingredients_map = [ (0,8),(0,9),(0,7),(1,2),(1,1),(1,4),(2,7),(2,9),(2,10),(2,11), (3,3),(3,2),(3,12),(3,13),(4,1),(4,4),(5,6),(5,8),(5,14),(5,12), (6,6),(6,0),(6,12),(6,4),(6,15), ]
So, recipe 0 ("Tuna casserole") contains ingredients 8, 9, and 7 ("tuna," "cream of mushroom soup," and "noodles"). Not exactly the Cordon Bleu, but enough to get us started.
Our search engine will focus on finding recipes that contain given ingredients, so we'll define a Python dict named recipesByIngredient
, which will serve as our database index, to make it easier to find which recipes reference a particular ingredient:
recipesByIngredient = dict((i,[]) for i in ingredients) for recIndex,ingIndex in recipe_ingredients_map: recipesByIngredient[ ingredients[ingIndex] ].append( recipes[recIndex] )
With our basic data reference established, we can work on applying the search string parser to extract data by search string.
The BNF for the search string is exactly as before, and we can reuse the operatorPrecedence
implementation as is:
searchExpr ::= searchAnd [ OR searchAnd ]... searchAnd ::= searchTerm [ AND searchTerm ]... searchTerm ::= [NOT] ( single-word | quotedString | '(' searchExpr ')' ) and_ = CaselessLiteral("and") or_ = CaselessLiteral("or") not_ = CaselessLiteral("not") searchTerm = Word(alphanums) | quotedString.setParseAction( removeQuotes ) searchExpr = operatorPrecedence( searchTerm, [ (not_, 1, opAssoc.RIGHT), (and_, 2, opAssoc.LEFT), (or_, 2, opAssoc.LEFT), ])
The next step is to modify the operation descriptions in the operator-Precedence
definition to add parse actions to perform the creation of the search data structure. These parse actions will be different from previous parse actions we have created. Instead of modifying the string tokens or returning new string tokens, these parse actions will be class constructors that take a ParseResults
object and return an object that will perform some form of data search evaluation. Each of the three operators NOT, AND, and OR will have its own search evaluation class, SearchNot
, SearchAnd
, and SearchOr
.
Let's abstract the construction of these objects into two classes, BinaryOperation
and UnaryOperation
. The purpose of these classes will simply be to have an initializer that pulls the correct arguments from the tokens passed to a parse action, and saves them to suitable instance attributes:
class UnaryOperation(object): def __init__(self, tokens): self.op, self.a = tokens[0]
This simple initializer will extract the operator into an attribute named op
, and the argument into attribute a
.
class BinaryOperation(object): def __init__(self, t): self.op = tokens[0][1] self.operands = tokens[0][0::2]
The initializer for BinaryOperation
looks a little more complicated. Binary operations in operatorPrecedence
can return multiple operation/operands if multiple operations of the same precedence level occur together. For instance, a binary operation for addition will parse "a+b+c"
as [ 'a', '+', 'b', '+', 'c' ]
, not [[ 'a', '+', 'b' ], '+', 'c' ]
, so BinaryOperation
needs to be able to recognize and process a chain of like operations, not just a simple a op b
operation.
The Python [0::2]
slicing notation indicates that the operands will be taken from the list of tokens beginning with the 0th element, and then stepping by 2 until the end of the list. So, this will process a token list such as ['a', '+', 'b', '+', 'c']
, and give us the tokens ['a', 'b', 'c']
.
For each operator type, we create a handler class that will derive from one of these classes. For now, let's just define handler classes that can display their own repr
strings:
class SearchAnd(BinaryOperation): def __repr__(self): return "AND:(%s)" % (",".join(str(oper) for oper in self.operands)) class SearchOr(BinaryOperation): def __repr__(self): return "OR:(%s)" % (",".join(str(oper) for oper in self.operands)) class SearchNot(UnaryOperation): def __repr__(self): return "NOT:(%s)" % str(self.a)
To construct these objects during a parse action, you might consider creating parse actions that look something like:
def makeSearchAnd(tokens): searchAnd = SearchAnd(tokens) return searchAnd ...
and then pass makeSearchAnd
, etc., into the operatorPrecedence
call:
searchExpr = operatorPrecedence( searchTerm, [ (not_, 1, opAssoc.RIGHT, makeSearchNot), (and_, 2, opAssoc.LEFT, makeSearchAnd), (or_, 2, opAssoc.LEFT, makeSearchOr), ])
In fact, the classes themselves are sufficient objects to pass as parse actions, since their initializers already have the proper argument lists to accept the arguments passed to a parse action. So, we can discard the makeSearchXXX
methods, and just pass the classes directly as the parse actions for operatorPrecedence
:
searchExpr = operatorPrecedence( searchTerm, [ (not_, 1, opAssoc.RIGHT, SearchNot), (and_, 2, opAssoc.LEFT, SearchAnd), (or_, 2, opAssoc.LEFT, SearchOr), ])
For completeness, we will create one more class, to be added as a parse action for the base operand search terms themselves:
class SearchTerm(object): def __init__(self, tokens): self.term = tokens[0] def __repr__(self): return self.term searchTerm.setParseAction( SearchTerm )
If we add these classes and the modified operatorPrecedence
call to the existing parser code, we can visualize the search expressions data structures now being built. Once again, we will need some sample search strings to test the parser:
pineapple pineapple and 'pizza crust' noodles and tuna or ham noodles and (tuna or ham) pineapple and noodles tuna or ham or pineapple tuna or (ham and not pineapple) not tuna not (pineapple or tuna)
Since our returned objects implement suitable repr
methods, we can just print the ParseResults
object returned from calling parseString
, and see the data structures that are created:
pineapple -> pineapple pineapple and 'pizza crust' -> AND:(pineapple,pizza crust) noodles and tuna or ham -> OR:(AND:(noodles,tuna),ham) noodles and (tuna or ham) -> AND:(noodles,OR:(tuna,ham)) pineapple and noodles -> AND:(pineapple,noodles) tuna or ham or pineapple -> OR:(tuna,ham,pineapple) tuna or (ham and not pineapple) -> OR:(tuna,AND:(ham,NOT:(pineapple))) not tuna -> NOT:(tuna) not (pineapple or tuna) -> NOT:(OR:(pineapple,tuna))
Now that we are parsing the search string into a data structure of objects, we can put those objects to work searching for matching data.
The design we will implement will be very similar to the one used in the Venn diagrams that illustrated the search results matching the test strings in the first search string example. Each region in the diagram represents a set of objects. We will use Python sets to model these separate groups. For each SearchXXX
class, we will generate a fragment Python set expression, using Python set operations such as &
for intersection, |
for union, and -
for negation. The concatenation of these fragments will give us an overall set expression, which can be evaluated using the Python eval
command. The results of the evaluated set expression will be the desired results of the original search string. We will in essence have compiled the search string into a Python expression, which we then execute to get the requested search results.
SearchTerm
is the simplest to implement and will set the stage for the more complex operations. When a term such as "pineapple" is included in the search string, we would like to select all of the recipes that include "pineapple" as one of their ingredients. Given the in-memory mini-database we created at the beginning of this section, we can find the set of all recipes directly from the global dict variable recipesByIngredient
:
set( recipesByIngredient['pineapple'] )
We should also guard against searches for ingredients that are not in the database so that if a term is not in recipesByIngredient
, we return an empty set.
So, the implementation of generateSetExpression
for SearchTerm
is:
def generateSetExpression(self): if self.term in recipesByIngredient: return "set(recipesByIngredient['%s'])" % self.term else: return "set()"
SearchAnd
and SearchOr
are almost as easy. Since these binary operations have an operands
attribute initialized to contain the terms or subexpressions that are to be "and"ed or "or"ed, the generateSetExpression
method for SearchAnd
and SearchOr
can be the results of the operands' generateSetExpression
methods, joined by &
and |
, respectively:
# class SearchAnd def generateSetExpression(self): return "(%s)" % " & ".join(oper.generateSetExpression() for oper in self.operands) # class SearchOr def generateSetExpression(self): return "(%s)" % " | ".join(oper.generateSetExpression() for oper in self.operands)
SearchNot
is a bit more difficult. The set expression for the set of items that are not in set X is the set of all items minus X. In Python, we would implement ∼X as:
set(everything) - set(X)
To implement generateSetExpression
for SearchNot
, we will return the negation of the operand's set expression as the operand's set subtracted from the set of all recipes:
# class SearchNot def generateSetExpression(self): return "(set(recipes) - %s)" % self.a.generateSetExpression()
That completes the work on the search classes. Let's rerun the tests to see how the generated set expressions look:
pineapple -> set(recipesByIngredient['pineapple']) pineapple and 'pizza crust' -> (set(recipesByIngredient['pineapple']) & set(recipesByIngredient['pizza crust'])) noodles and tuna or ham -> ((set(recipesByIngredient['noodles']) & set(recipesByIngredient['tuna'])) | set(recipesByIngredient['ham'])) noodles and (tuna or ham) -> (set(recipesByIngredient['noodles']) & (set(recipesByIngredient['tuna']) | set(recipesByIngredient['ham']))) pineapple and noodles -> (set(recipesByIngredient['pineapple']) & set(recipesByIngredient['noodles'])) tuna or ham or pineapple -> (set(recipesByIngredient['tuna']) | set(recipesByIngredient['ham']) | set(recipesByIngredient['pineapple'])) tuna or (ham and not pineapple) -> (set(recipesByIngredient['tuna']) | (set(recipesByIngredient['ham']) & (set(recipes) - set(recipesByIngredient['pineapple'])))) not tuna -> (set(recipes) - set(recipesByIngredient['tuna'])) not (pineapple or tuna) -> (set(recipes) - (set(recipesByIngredient['pineapple']) | set(recipesByIngredient['tuna']))) anchovies -> set()
(I added the last search string as a test of the nonexistent ingredient search.)
The only remaining step for each search string is to evaluate its Python set expression, and list out the resulting items. The following code lists the complete parser, including test data initialization and test search strings:
from pyparsing import * # populate ingredients->recipes "database" recipes = "Tuna casserole/Hawaiian pizza/Chicken a la King/" "Pepperoni pizza/Baked ham/Tuna melt/Eggs Benedict" .split("/") ingredients = "eggs/pineapple/pizza crust/pepperoni/ham/bacon/" "English muffins/noodles/tuna/cream of mushroom soup/chicken/" "mixed vegetables/cheese/tomato sauce/mayonnaise/Hollandaise sauce" .split("/") recipe_ingredients_map = [ (0,8),(0,9),(0,7),(1,2),(1,1),(1,4),(2,7),(2,9),(2,10),(2,11), (3,3),(3,2),(3,12),(3,13),(4,1),(4,4),(5,6),(5,8),(5,14),(5,12), (6,6),(6,0),(6,12),(6,4),(6,15), ] recipesByIngredient = dict((i,[]) for i in ingredients) for recIndex,ingIndex in recipe_ingredients_map: recipesByIngredient[ ingredients[ingIndex] ].append( recipes[recIndex] ) # classes to be constructed at parse time, from intermediate ParseResults class UnaryOperation(object): def __init__(self, t): self.op, self.a = t[0] class BinaryOperation(object): def __init__(self, t): self.op = t[0][1] self.operands = t[0][0::2] class SearchAnd(BinaryOperation): def generateSetExpression(self): return "(%s)" % " & ".join(oper.generateSetExpression() for oper in self.operands) def __repr__(self): return "AND:(%s)" % (",".join(str(oper) for oper in self.operands)) class SearchOr(BinaryOperation): def generateSetExpression(self): return "(%s)" % " | ".join(oper.generateSetExpression() for oper in self.operands) def __repr__(self): return "OR:(%s)" % (",".join(str(oper) for oper in self.operands)) class SearchNot(UnaryOperation): def generateSetExpression(self): return "(set(recipes) - %s)" % self.a.generateSetExpression() def __repr__(self): return "NOT:(%s)" % str(self.a) class SearchTerm(object): def __init__(self, tokens): self.term = tokens[0] def generateSetExpression(self): if self.term in recipesByIngredient: return "set(recipesByIngredient['%s'])" % self.term else: return "set()" def __repr__(self): return self.term # define the grammar and_ = CaselessLiteral("and") or_ = CaselessLiteral("or") not_ = CaselessLiteral("not") searchTerm = Word(alphas) | quotedString.setParseAction( removeQuotes ) searchTerm.setParseAction(SearchTerm) searchExpr = operatorPrecedence( searchTerm, [ (not_, 1, opAssoc.RIGHT, SearchNot), (and_, 2, opAssoc.LEFT, SearchAnd), (or_, 2, opAssoc.LEFT, SearchOr), ]) # test the grammar and selection logic test = """ pineapple pineapple and 'pizza crust' noodles and tuna or ham noodles and (tuna or ham) pineapple and noodles tuna or ham or pineapple tuna or (ham and not pineapple) not tuna not (pineapple or tuna) anchovies""".splitlines() for t in test: try: evalStack = (searchExpr + stringEnd).parseString(t)[0] except ParseException, pe: print "Invalid search string" continue print "Search string:", t # print "Eval stack: ", evalStack evalExpr = evalStack.generateSetExpression() # print "Eval expr: ", evalExpr matchingRecipes = eval(evalExpr) if matchingRecipes: for r in matchingRecipes: print "-", r else: print " (none)" print
And here are the test results:
Search string: pineapple - Baked ham - Hawaiian pizza Search string: pineapple and 'pizza crust' - Hawaiian pizza Search string: noodles and tuna or ham - Baked ham - Eggs Benedict - Hawaiian pizza - Tuna casserole Search string: noodles and (tuna or ham) - Tuna casserole Search string: pineapple and noodles (none) Search string: tuna or ham or pineapple - Eggs Benedict - Tuna melt - Hawaiian pizza - Tuna casserole - Baked ham Search string: tuna or (ham and not pineapple) - Eggs Benedict - Tuna melt - Tuna casserole Search string: not tuna - Pepperoni pizza - Baked ham - Eggs Benedict - Hawaiian pizza - Chicken a la King Search string: not (pineapple or tuna) - Pepperoni pizza - Eggs Benedict - Chicken a la King Search string: anchovies (none)
At the end, we find that we have created a complete search engine in less than 100 lines of code—time to go chase some venture capital!
18.223.239.226