Parsing a Search String

The explosion of data made possible by the technologies of the Internet and the World Wide Web has led to the emergence of applications and services for searching and organizing that mass of data. A typical interface to a search service is a string of keywords that is used to retrieve web pages of interest to the searcher. Services such as Google have very simplified search interfaces, in which each separate word is assumed to be a potential keyword, and the search engine will look for pages containing any of the given keywords (perhaps ranking the pages by the number of keywords present on the page).

In this application, I am going to describe a more elaborate search string interface, with support for AND, OR, and NOT keyword qualifiers. Keywords may be single words delimited by whitespace, or a quoted string for keywords that contain spaces or non-alphanumeric characters, or for a search keyword or phrase that includes one of the special qualifier words AND, OR, or NOT. Here are a few sample search phrases for us to parse:

    wood and blue or red
    wood and (blue or red)
    (steel or iron) and "lime green"
    not steel or iron and "lime green"
    not(steel or iron) and "lime green"

describing objects in the simple universe depicted in this figure.

The universe of all possible things

Figure 1. The universe of all possible things

We would also like to have our parser return the parsed results in a hierarchical structure based on the precedence of operations among the AND, OR, and NOT qualifiers. In normal programming practice, the hierarchy of these operations is defined as:

  • NOT has the highest precedence, and is evaluated first

  • AND has the next highest precedence

  • OR has the lowest precedence, and is evaluated last

So, the phrase "wood and blue or red" is interpreted as "retrieve any item that is both wood and blue, OR any item that is red no matter what it is made of."

Things in the universe that are "wood and blue or red"

Figure 2. Things in the universe that are "wood and blue or red"

Parentheses can be used to override this precedence of operations, so that the phrase "wood and (blue or red)" will be interpreted as "only retrieve items that are made of wood AND are either blue or red."

Things in the universe that are "wood and (blue or red)"

Figure 3. Things in the universe that are "wood and (blue or red)"

Following our best practice, here is a BNF for this simple search string parser:

searchExpr ::= searchTerm [ ( AND | OR ) searchTerm ]...
searchTerm ::= [NOT] ( single-word | quotedString | '(' searchExpr ')' )

However, this BNF will not evaluate any precedence between AND and OR expressions. To do this, we must separate the AND and OR operations so that the precedence of evaluation will be represented in the parsed token structure.

To define a BNF that recognizes the precedence of operations, we must define a sequence of expressions in which each operation from lowest precedence to highest is parsed. The uppermost expression will define the operator with the lowest precedence, using as operands an expression that defines the operator with the next lowest precedence, and so on. At parsing time, the parser will recursively evaluate the expressions, so that through recursion, the highest precedence operator found in the input text will be recognized and evaluated first.

In this version of the BNF, AND expressions will be parsed ahead of OR expressions, and NOT expressions will be parsed ahead of AND expressions.

searchExpr ::= searchAnd [ OR searchAnd ]...
searchAnd  ::= searchTerm [ AND searchTerm ]...
searchTerm ::= [NOT] ( single-word | quotedString | '(' searchExpr ')' )

This grammar can be implemented directly into pyparsing expressions:

searchExpr = Forward()
searchTerm = Optional(not_) + ( Word(alphas) |
                                quotedString.setParseAction( removeQuotes ) | 
                                Group( LPAREN + searchExpr + RPAREN ) )
searchAnd = Group( searchTerm + ZeroOrMore( and_ + searchTerm ) )
searchExpr << Group( searchAnd + ZeroOrMore( or_ + searchAnd ) )

Using this grammar, our parsing examples evaluate as:

wood and blue or red
[[['wood', 'and', 'blue'], 'or', ['red']]]

wood and (blue or red)
[[['wood', 'and', [[['blue'], 'or', ['red']]]]]]

(steel or iron) and "lime green"
[[[[[['steel'], 'or', ['iron']]], 'and', 'lime green']]]

not steel or iron and "lime green"
[[['not', 'steel'], 'or', ['iron', 'and', 'lime green']]]

not(steel or iron) and "lime green"
[[['not', [[['steel'], 'or', ['iron']]], 'and', 'lime green']]]

These results can now be evaluated using a depth-first recursive method, and the AND, OR, and NOT expressions will be evaluated in the proper precedence.

Defining and evaluating expressions with operators of various precedence is a common application for parsers, and pyparsing includes a method named operatorPrecedence to simplify the definition of such grammars.

To use operatorPrecedence, you must first define a parse expression for the most basic operand. For this search application, the smallest operand is a search term given as a single word or quoted string:

searchTerm = Word(alphas) | quotedString.setParseAction( removeQuotes ) )

Using this base operand, you then compose the search expression by calling operatorPrecedence with this operand, and a list of operator descriptions. For each operator or set of operators with the same precedence level, you define:

  • An expression to match the operator or set of operators

  • The integer value 1 or 2 to indicate whether this is a unary or binary operator

  • The pyparsing constant opAssoc.LEFT or opAssoc.RIGHT to describe whether this operator is left- or right-associative

  • An optional parse action to be performed when an operation is matched

To define the NOT, AND, and OR operations for search terms, we can call operatorPrecedence using:

searchExpr = operatorPrecedence( searchTerm,
       [
       (not_, 1, opAssoc.RIGHT),
       (and_, 2, opAssoc.LEFT),
       (or_, 2, opAssoc.LEFT),
       ])

Not only is this simpler to define, it also creates internal grouping, so that the test strings now parse to the simpler structures:

wood and blue or red
[[['wood', 'and', 'blue'], 'or', 'red']]


wood and (blue or red)
[['wood', 'and', ['blue', 'or', 'red']]]

(steel or iron) and "lime green"
[[['steel', 'or', 'iron'], 'and',
    'lime green']]

not steel or iron and "lime green"
[[['not', 'steel'], 'or', ['iron', 'and',
    'lime green']]]

not(steel or iron) and "lime green"
[[['not', ['steel', 'or', 'iron']], 'and',
    'lime green']]

Here is the complete parser code for our simple search expression parser:

from pyparsing import *

and_ = CaselessLiteral("and")
or_  = CaselessLiteral("or")
not_ = CaselessLiteral("not")
searchTerm = Word(alphanums) | quotedString.setParseAction( removeQuotes )
searchExpr = operatorPrecedence( searchTerm,
       [
       (not_, 1, opAssoc.RIGHT),
       (and_, 2, opAssoc.LEFT),
       (or_, 2, opAssoc.LEFT),
       ])

tests = """
    wood and blue or red
    wood and (blue or red)
    (steel or iron) and "lime green"
    not steel or iron and "lime green"
    not(steel or iron) and "lime green"""".splitlines()

for t in tests:
    print t.strip()
    print searchExpr.parseString(t)[0]
    print

Results:

wood and blue or red
[['wood', 'and', 'blue'], 'or', 'red']

wood and (blue or red)
['wood', 'and', ['blue', 'or', 'red']]

(steel or iron) and "lime green"
[['steel', 'or', 'iron'], 'and', 'lime green']

not steel or iron and "lime green"
[['not', 'steel'], 'or', ['iron', 'and', 'lime green']]

not(steel or iron) and "lime green"
[['not', ['steel', 'or', 'iron']], 'and', 'lime green']
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.151.32