4.4. Code and Code Explanation

Code for the search engine is contained in three files. The administrative interface is saved in the publicly accessible directory as admin.php. It should be protected from unauthorized use by including the lib/401.php include file. The front-end code is also saved in the public_files directory as search.php. The crawler/indexer functionality is saved outside the public area as indexer.php.

4.4.1. Administrative Interface

The administrative interface provides an area to enter addresses that will either be included or excluded from the index, and also maintains the list of stop words. The display consists of an HTML form with two textareas. The processing of the input is done with PHP.

The first HTML textarea provides a place to enter the URLs of documents that will be included in the search engine's retrieval efforts. The second textarea provides a place for the list of stop words to be given. Each are pre-populated from appropriate database records with each item appearing on a separate line.

<form method="post"
 action="<?php echo htmlspecialchars($_SERVER['PHP_SELF']); ?>">
 <table>
  <tr>
   <td style="vertical-align:top; text-align:right">
    <label for="addresses">Include Addresses</label></td>
   <td><small>Enter addresses to include in crawling, one address per
    line.</small><br/>
    <textarea name="addresses" id="addresses" rows="5" cols="60"><?php

$query = sprintf('SELECT DOCUMENT_URL FROM %sSEARCH_CRAWL ' .
    'ORDER BY DOCUMENT_URL ASC', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
while ($row = mysql_fetch_array($result))
{
    echo htmlspecialchars($row['DOCUMENT_URL']) . "
";
}
mysql_free_result($result);

?></textarea>
   </td>
  </tr><tr>
   <td style="vertical-align:top; text-align:right">
    <label for="stop_words">Stop Words</label></td>
   <td><small>Enter words to omit from the index, one per line.</small><br/>
    <textarea name="stop_words" id="stop_words" rows="5" cols="60"><?php

$query = sprintf('SELECT TERM_VALUE FROM %sSEARCH_STOP_WORD ORDER BY ' .
    'TERM_VALUE ASC', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
while ($row = mysql_fetch_array($result))
{
    echo htmlspecialchars($row['TERM_VALUE']) . "
";
}
mysql_free_result($result);

?></textarea>
   </td>
  </tr><tr>
   <td> </td>
   <td><input type="submit" value="Submit"/></td>
   <td><input type="hidden" name="submitted" value="1"/></td>
  </tr><tr>
 </table>
</form>

The addresses and stop words are updated each time the form is saved. Old records are discarded and the table is reinitialized using TRUNCATE TABLE on both the WROX_SEARCH_CRAWL and WROX_SEARCH_STOP_WORD tables. Alternatively, you can issue a query such as DELETE FROM WROX_SEARCH_CRAWL but TRUNCATE TABLE is oftentimes a more efficient approach than using DELETE if the entire table's data set is targeted. It also has the benefit of conveying more semantic meaning to someone else reading your code than DELETE.

if (isset($_POST['submitted']))
{
    $query = sprintf('TRUNCATE TABLE %sSEARCH_CRAWL', DB_TBL_PREFIX);
    mysql_query($query, $GLOBALS['DB']);

    $addresses = explode_items($_POST['addresses'], "
", false);
    if (count($addresses))
    {
        $values = array();
        foreach ($addresses as $address)
        {
            $values[] = mysql_real_escape_string($address, $GLOBALS['DB']);
        }
        $query = sprintf('INSERT INTO %sSEARCH_CRAWL (DOCUMENT_URL) ' .
            'VALUES ("%s")', DB_TBL_PREFIX,
            implode ('"), ("', $values));
        mysql_query($query, $GLOBALS['DB']);
    }

    $query = sprintf('TRUNCATE TABLE %sSEARCH_STOP_WORD', DB_TBL_PREFIX);
    mysql_query($query, $GLOBALS['DB']);

    $words = explode_items($_POST['stop_words'], "
", false);
    if (count($words))
    {
        $values = array();
        foreach ($words as $word)
        {
            $values[] = mysql_real_escape_string($word, $GLOBALS['DB']);
        }
        $query = sprintf('INSERT INTO %sSEARCH_STOP_WORD (TERM_VALUE) ' .
            'VALUES ("%s")', DB_TBL_PREFIX, implode ('"), ("', $values));
        mysql_query($query, $GLOBALS['DB']);
    }
}

If you use PHP's explode() function to split the input text on newline characters ( ) into arrays, you may encounter blank lines or trailing carriage returns ( ) depending on what the user entered and his or her platform. Additional processing will be required to clean the list. explode_items() is a custom function that can be added to your growing lib/functions.php to augment explode() and accepts a block of text and an optional separator to parse into an array. As you do not want duplicate values, an optional argument can be provided that filters them out:

// convert a list of items (separated by newlines by default) into an array
// omitting blank lines and optionally duplicates
function explode_items($text, $separator = "
", $preserve = true)
{
    $items = array();
    foreach (explode($separator, $text) as $value)
    {
        $tmp = trim($value);
        if ($preserve)
        {
             $items[] = $tmp;
        }
        else
        {
            if (!empty($tmp))
            {
                $items[$tmp] = true;
            }
        }
    }

    if ($preserve)
    {
        return $items;
    }
    else
    {
        return array_keys($items);
    }
}

Each segment is run through trim() to remove any trailing whitespace and is collected. If duplicate entries are not required, the value is stored as an array key to ensure duplicates aren't allowed. The keys are then shifted to a usable array using array_keys() before they are returned. Alternatively, you could populate an array using the URLs and then call array_unique() to filter duplicates. I chose the key/array_keys() approach, because it's more efficient with larger sets of data and it's known beforehand that only unique values should be returned.

Here is the complete code for public_files/admin.php:

<?php
// include shared code
include '../lib/common.php';
include '../lib/db.php';
include '../lib/functions.php';

// must be logged in to access this page
include '401.php';

// processes incoming data if the form has been submitted

if (isset($_POST['submitted']))
{
    // delete existing addresses
    $query = sprintf('TRUNCATE TABLE %sSEARCH_CRAWL', DB_TBL_PREFIX);
    mysql_query($query, $GLOBALS['DB']);

    // add addresses list to database
    $addresses = explode_items($_POST['addresses'], "
", false);
    if (count($addresses))
    {
        $values = array();
        foreach ($addresses as $address)
        {
            $values[] = mysql_real_escape_string($address, $GLOBALS['DB']);
        }
        $query = sprintf('INSERT INTO %sSEARCH_CRAWL (DOCUMENT_URL) ' .
            'VALUES ("%s")', DB_TBL_PREFIX,
            implode ('"), ("', $values));
        mysql_query($query, $GLOBALS['DB']);
    }

    // delete existing stop words
    $query = sprintf('TRUNCATE TABLE %sSEARCH_STOP_WORD', DB_TBL_PREFIX);
    mysql_query($query, $GLOBALS['DB']);

    // add stop word list to database
    $words = explode_items($_POST['stop_words'], "
", false);
    if (count($words))
    {
        $values = array();
        foreach ($words as $word)
        {
            $values[] = mysql_real_escape_string($word, $GLOBALS['DB']);
        }
        $query = sprintf('INSERT INTO %sSEARCH_STOP_WORD (TERM_VALUE) ' .
            'VALUES ("%s")', DB_TBL_PREFIX, implode ('"), ("', $values));
        mysql_query($query, $GLOBALS['DB']);
    }
}

// generate HTML form
ob_start();
?>
<form method="post"
 action="<?php echo htmlspecialchars($_SERVER['PHP_SELF']); ?>">
 <table>
  <tr>
   <td style="vertical-align:top; text-align:right">
    <label for="addresses">Include Addresses</label></td>
   <td><small>Enter addresses to include in crawling, one address per
    line.</small><br/>
    <textarea name="addresses" id="addresses" rows="5" cols="60"><?php

//retrieve list of addresses
$query = sprintf('SELECT DOCUMENT_URL FROM %sSEARCH_CRAWL ' .
    'ORDER BY DOCUMENT_URL ASC', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
while ($row = mysql_fetch_array($result))
{
    echo htmlspecialchars($row['DOCUMENT_URL']) . "
";
}
mysql_free_result($result);

   ?></textarea>
   </td>
  </tr><tr>
   <td style="vertical-align:top; text-align:right">
    <label for="stop_words">Stop Words</label></td>
   <td><small>Enter words to omit from the index, one per line.</small><br/>
    <textarea name="stop_words" id="stop_words" rows="5" cols="60"><?php

//retrieve list of stop words
$query = sprintf('SELECT TERM_VALUE FROM %sSEARCH_STOP_WORD ORDER BY ' .
    'TERM_VALUE ASC', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
while ($row = mysql_fetch_array($result))
{
    echo htmlspecialchars($row['TERM_VALUE']) . "
";
}
mysql_free_result($result);

   ?></textarea>
   </td>
  </tr><tr>
   <td> </td>
   <td><input type="submit" value="Submit"/></td>
   <td><input type="hidden" name="submitted" value="1"/></td>
  </tr><tr>
 </table>
</form>
<?php
$GLOBALS['TEMPLATE']['content'] = ob_get_clean();

// display the page
include '../templates/template-page.php';
?>

Figure 4-1 shows the administration page viewed in a web browser with sample addresses and stop words populating the textarea fields.

Figure 4-1. Figure 4-1

Also, here is a short list of English stop words to get you started. Mostly they are pronouns, prepositions and conjugations of the verb to be.

abyIsheus
aboutcouldifsowas
alsodointhatwe
amforisthewere
anfromittheirwhat
andhadletthemwhen
anyhasmethenwhere
arehaveminetherewhich
ashemythesewhile
atheroftheywhy
behimonthiswith
beenhisorthroughyou
beinghersovertoyour
buthowputtoo 

4.4.2. Crawler/Indexer

Typically the crawler is the component of the search engine responsible for going out and retrieving content for the indexer to catalog. It reads the list of addresses from the database and downloads a copy of each document to queue locally to disk where the indexer can access them. Then the indexer component processes each file in the queue. This tag-team approach works well for large search sites with massive amounts of data continuously being indexed or if the crawler scans through the documents in search of links to other documents to retrieve (as is the case with recursive downloading/leeching). I have decided to combine the crawler and indexer components in this project. That is, the document is processed and added to the search index immediately after it is retrieved.

The script starts by deleting the stale index and retrieving the list of stop words. Again I choose to employ the technique of storing values as array keys to make checking against items in the list more efficient since every word in every document will be compared against it. One should write code for clarity, maintainability, and scalability first, and then optimize sections once performance bottlenecks are realized. However, it makes sense to make such optimizations beforehand if the situation is as obvious as this.

$query = sprintf('TRUNCATE TABLE %sSEARCH_INDEX', DB_TBL_PREFIX);
mysql_query($query, $GLOBALS['DB']);

$query = sprintf('TRUNCATE TABLE %sSEARCH_TERM', DB_TBL_PREFIX);
mysql_query($query, $GLOBALS['DB']);

$query = sprintf('TRUNCATE TABLE %sSEARCH_DOCUMENT', DB_TBL_PREFIX);
mysql_query($query, $GLOBALS['DB']);

$query = sprintf('SELECT TERM_VALUE FROM %sSEARCH_STOP_WORD', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
$stop_words = array();
while ($row = mysql_fetch_array($result))
{
    $stop_words[$row['TERM_VALUE']] = true;
}
mysql_free_result($result);

The script then retrieves the list of documents that need to be processed from the database and acts on each one. Retrieving files is done with assistance from the Client URL (CURL) library extension, a powerful library for transferring files over a wide variety of protocols. A CURL resource handle is obtained with curl_init() and several options are set with curl_setopt(). Close to 100 options can be set.

CURLOPT_FOLLOWLOCATION instructs CURL to follow HTTP 30x-style redirect responses. CURLOPT_HEADER disregards the response headers so they are not included with the returned content. CURLOPT_RETURNTRANSFER sets CURL to return content as a string instead of directing it to the output buffer. CURLOPT_URL specifies the URL address of a document to retrieve. Although most of the options are set outside the while loop because they only need to be set once, CURLOPT_URL is set for each document and is positioned inside the loop. (For a list of all available options, go to www.php.net/curl_setopt).

$ch = curl_init();

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Search Engine Indexer'),

$query = sprintf('SELECT DOCUMENT_URL FROM %sSEARCH_CRAWL', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
while ($row = mysql_fetch_array($result))
{
    echo 'Processing: ' . $row['DOCUMENT_URL'] . "...
";

    curl_setopt($ch, CURLOPT_URL, $row['DOCUMENT_URL']);
    $file = curl_exec($ch);

...
}

curl_close($ch);

The document's title and description are extracted with the help of SimpleXML. Content is provided to simplexml_load_string(), which parses it as an XML document and returns an easily traversable object. The document's title, description, and address are stored in the WROX_SEARCH_DOCUMENT table where it is automatically assigned a unique index by MySQL. The index is retrieved with mysql_insert_id() so it can be used later to link words back to the document.

SimpleXML will generate warnings if the HTML code is not properly constructed. If the libtidy extension is installed, then you can use it to clean up the code prior to providing it to SimpleXML to parse. Otherwise you'll have to ignore the warnings or suppress them with the @ operator.

$file = tidy_repair_string($file);
$html = simplexml_load_string($file);

// or: $html = @simplexml_load_string($file);

if ($html->head->title)
{
    $title = $html->head->title;
}
else
{
    // use the filename if a title is not found
    $title = basename($row['DOCUMENT_URL']);
}

$description = 'No description provided.';
foreach($html->head->meta as $meta)
{
    if (isset($meta['name']) && $meta['name'] == 'description')

{
        $description = $meta['content'];
        break;
    }
}

$query = sprintf('INSERT INTO %sSEARCH_DOCUMENT (DOCUMENT_URL, ' .
    'DOCUMENT_TITLE, DESCRIPTION) VALUES ('%s', '%s', '%s')',
    DB_TBL_PREFIX,
    mysql_real_escape_string($row['DOCUMENT_URL'], $GLOBALS['DB']),
    mysql_real_escape_string($title, $GLOBALS['DB']),
    mysql_real_escape_string($description, $GLOBALS['DB']));
mysql_query($query, $GLOBALS['DB']);

// retrieve the document's id
$doc_id = mysql_insert_id($GLOBALS['DB']);

Only text rendered to the viewer should be included in the index. Because it doesn't make sense to include HTML tags in search results (and therefore the index), they are removed from the content with the strip_tags() function. Content can then be split into individual words with str_word_count(). Each word is checked against the stop words list, if it has already been added previously to the database and then added to the search index.

$file = strip_tags($file);

foreach (str_word_count($file, 1) as $index => $word)
{
    // words should be stored as lowercase for comparisons
    $word = strtolower($word);

    if (isset($stop_words[$word])) continue;

    $query = sprintf('SELECT TERM_ID FROM %sSEARCH_TERM WHERE ' .
        'TERM_VALUE = "%s"',
        DB_TBL_PREFIX,
        mysql_real_escape_string($word, $GLOBALS['DB']));
    $result2 = mysql_query($query, $GLOBALS['DB']);
    if (mysql_num_rows($result2))
    {
        list($word_id) = mysql_fetch_row($result2);
    }
    else
    {
        $query = sprintf('INSERT INTO %sSEARCH_TERM (TERM_VALUE) ' .
            'VALUE ("%s")',
            DB_TBL_PREFIX,
            mysql_real_escape_string($word, $GLOBALS['DB']));
        mysql_query($query, $GLOBALS['DB']);

        $word_id = mysql_insert_id($GLOBALS['DB']);
    }

mysql_free_result($result2);

     $query = sprintf('INSERT INTO %sSEARCH_INDEX (DOCUMENT_ID, ' .
         'TERM_ID, OFFSET) VALUE (%d, %d, %d)',
         DB_TBL_PREFIX,
         $doc_id,
         $word_id,
         $index);
     mysql_query($query, $GLOBALS['DB']);
}

Here is the complete code for indexer.php. As it is meant to be run as a shell script, the first line must point to the location of the PHP interpreter and the file's execute permissions must be set.

#! /usr/bin/php
<?php
// include shared code
include 'lib/common.php';
include 'lib/db.php';

// clear index tables
$query = sprintf('TRUNCATE TABLE %sSEARCH_INDEX', DB_TBL_PREFIX);
mysql_query($query, $GLOBALS['DB']);

$query = sprintf('TRUNCATE TABLE %sSEARCH_TERM', DB_TBL_PREFIX);
mysql_query($query, $GLOBALS['DB']);

$query = sprintf('TRUNCATE TABLE %sSEARCH_DOCUMENT', DB_TBL_PREFIX);
mysql_query($query, $GLOBALS['DB']);

// retrieve the list of stop words
$query = sprintf('SELECT TERM_VALUE FROM %sSEARCH_STOP_WORD', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
$stop_words = array();
while ($row = mysql_fetch_array($result))
{
    // since this list will be checked for each word, use term as the array
    // key-- isset($stop_words[<term>]) is more efficient than using
    // in_array(<term>, $stop_words)
    $stop_words[$row['TERM_VALUE']] = true;
}
mysql_free_result($result);

// open CURL handle for downloading
$ch = curl_init();

// set curl options
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Search Engine Indexer'),

// fetch list of documents to index
$query = sprintf('SELECT DOCUMENT_URL FROM %sSEARCH_CRAWL', DB_TBL_PREFIX);
$result = mysql_query($query, $GLOBALS['DB']);
while ($row = mysql_fetch_array($result))
{
    echo 'Processing: ' . $row['DOCUMENT_URL'] . "...
";

    // retrieve the document's content
    curl_setopt($ch, CURLOPT_URL, $row['DOCUMENT_URL']);
    $file = curl_exec($ch);

    $file = tidy_repair_string($file);
    $html = simplexml_load_string($file);

    // or: $html = @simplexml_load_string($file);

    // extact the title
    if ($html->head->title)
    {
        $title = $html->head->title;
    }
    else
    {
        // use the filename if a title is not found
        $title = basename($row['DOCUMENT_URL']);
    }

    // extract the description
    $description = 'No description provided.';
    foreach($html->head->meta as $meta)
    {
        if (isset($meta['name']) && $meta['name'] == 'description')
        {
            $description = $meta['content'];
            break;
        }
    }

    // add the document to the index
    $query = sprintf('INSERT INTO %sSEARCH_DOCUMENT (DOCUMENT_URL, ' .
        'DOCUMENT_TITLE, DESCRIPTION) VALUES ("%s", "%s", "%s")',
        DB_TBL_PREFIX,
        mysql_real_escape_string($row['DOCUMENT_URL'], $GLOBALS['DB']),
        mysql_real_escape_string($title, $GLOBALS['DB']),
        mysql_real_escape_string($description, $GLOBALS['DB']));
        mysql_query($query, $GLOBALS['DB']);

    // retrieve the document's id
    $doc_id = mysql_insert_id($GLOBALS['DB']);

    // strip HTML tags out from the content

$file = strip_tags($file);

    // break content into individual words
    foreach (str_word_count($file, 1) as $index => $word)
    {
        // words should be stored as lowercase for comparisons
        $word = strtolower($word);

        // skip word if it appears in the stop words list
        if (isset($stop_words[$word])) continue;

        // determine if the word already exists in the database
        $query = sprintf('SELECT TERM_ID FROM %sSEARCH_TERM WHERE ' .
            'TERM_VALUE = "%s"',
            DB_TBL_PREFIX,
            mysql_real_escape_string($word, $GLOBALS['DB']));
        $result2 = mysql_query($query, $GLOBALS['DB']);
        if (mysql_num_rows($result2))
        {
            // word exists so retrieve its id
            list($word_id) = mysql_fetch_row($result2);
        }
        else
        {
            // add word to the database
            $query = sprintf('INSERT INTO %sSEARCH_TERM (TERM_VALUE) ' .
                'VALUE ("%s")',
                DB_TBL_PREFIX,
                mysql_real_escape_string($word, $GLOBALS['DB']));
            mysql_query($query, $GLOBALS['DB']);

            // determine the word's id
            $word_id = mysql_insert_id($GLOBALS['DB']);
        }
        mysql_free_result($result2);

         // add the index record
         $query = sprintf('INSERT INTO %sSEARCH_INDEX (DOCUMENT_ID, ' .
             'TERM_ID, OFFSET) VALUE (%d, %d, %d)',
             DB_TBL_PREFIX,
             $doc_id,
             $word_id,
             $index);
         mysql_query($query, $GLOBALS['DB']);
    }
}

mysql_free_result($result);
curl_close($ch);
echo 'Indexing complete.' . "
";
?>

You may choose to run the script manually from the command line, and then schedule it as a cron or scheduled tasks job to run periodically. Even if you decide to automate it, I recommend running indexer.php manually the first time to make sure everything indexes properly. It can be run from the command line like this, with output redirected to indexlog.txt for later analysis:

./indexer.php > indexlog.txt 2>&1 &

4.4.3. Front End

The front-end interface is what will be used by the site's users. It accepts the search terms from the user, queries the inverted index and return the matching results.

If a query has been submitted, the script uses the custom explode_items() function to break it down into an array of terms using a space as the separator. It then retrieves the list of stop words from the database and compares the search terms against it. Matching words are removed and will not be included in the eventual query of the index.

The number of JOIN clauses and conditions in the WHERE clause needed to query the index depends on the number of search words entered by the user so the query statement is built using three variables which are concatenated together. Afterwards, the query must be trimmed four spaces to remove the WHERE clauses' trailing AND.

Finally, the number of matching entries and the result set are returned to the user.

$words = array();
if (isset($_GET['query']) && trim($_GET['query']))
{
    $words = explode_items($_GET['query'], ' ', false);

    $query = sprintf('SELECT TERM_VALUE FROM %sSEARCH_STOP_WORD',
        DB_TBL_PREFIX);
    $result = mysql_query($query, $GLOBALS['DB']);
    $stop_words = array();
    while ($row = mysql_fetch_assoc($result))
    {
        $stop_words[$row['TERM_VALUE']] = true;
    }
    mysql_free_result($result);

    $words_removed = array();
    foreach ($words as $index => $word)
    {
        if (isset($stop_words[strtolower($word)]))
        {
            $words_removed[] = $word;
            unset($words[$index]);
        }
    }
}
ob_start();
?>
<form method="get"

action="<?php echo htmlspecialchars($_SERVER['PHP_SELF']); ?>">
 <div>
  <input type="text" name="query" id="query" value="<?php
   echo (count($words)) ? htmlspecialchars(join(' ', $words)) : '';?>"/>
  <input type="submit" value="Search"/>
 </div>
</form>
<?php

if (count($words))
{
    $join = '';
    $where = '';
    $query = 'SELECT DISTINCT D.DOCUMENT_URL, D.DOCUMENT_TITLE, ' .
        'D.DESCRIPTION FROM WROX_SEARCH_DOCUMENT D ';
    foreach ($words as $index => $word)
    {
        $join .= sprintf(
            'JOIN WROX_SEARCH_INDEX I%d ON D.DOCUMENT_ID = I%d.DOCUMENT_ID ' .
            'JOIN WROX_SEARCH_TERM T%d ON I%d.TERM_ID = T%d.TERM_ID ',
            $index, $index, $index, $index, $index);

        $where .= sprintf('T%d.TERM_VALUE = "%s" AND ', $index,
            mysql_real_escape_string(strtolower($word), $GLOBALS['DB']));
    }
    $query .= $join . 'WHERE ' . $where;

    // trimmed 4 characters to remove trailing ' AND'
    $query = substr($query, 0, strlen($query) - 4);

    $result = mysql_query($query, $GLOBALS['DB']);

    echo '<hr/>';
    $num_rows = mysql_num_rows($result);
    echo '<p>Search for <b>' . htmlspecialchars(join(' ', $words)) .
        '</b> yielded ' . $num_rows . ' result' .
        (($num_rows != 1) ? 's' : '') . ':</p>';

    echo '<ul>';
    while ($row = mysql_fetch_assoc($result))
    {
        echo '<li><b><a href="' .
            htmlspecialchars($row['DOCUMENT_URL']) . '">' .
            htmlspecialchars($row['DOCUMENT_TITLE']) . '</a></b>- ' .
            htmlspecialchars($row['DESCRIPTION']) . '<br/><i>' .
            htmlspecialchars($row['DOCUMENT_URL']) . '</i></li>';
    }
    echo '</ul>';
}
$GLOBALS['TEMPLATE']['content'] = ob_get_clean();

Each term may be checked for spelling using the Pspell extension and suggested spellings collected in a separate array. A resource handle to the spelling dictionary is obtained with pspell_new() which I have initialized with an English dictionary. You may provide the appropriate two-letter ISO 639 language code to specify an alternate language instead such as de for German, fr for French or eo for Esperanto.

The spelling of a word is checked with pspell_check() and an array of suggested alternates is retrieved with pspell_suggest(). Only the first spelling suggestion is used so not to overwhelm the user. The suggestion is also compared against the original word to ignore capitalization-related spelling "mistakes."

// spell check the query words
$spell_error = false;
$suggest_words = array();
$ps = pspell_new('en'),
foreach ($words as $index => $word)

    if (!pspell_check($ps, $word))
    {
        if ($s = pspell_suggest($ps, $word))
        {
            if (strtolower($s[0]) != strtolower($word))
            {
                // (ignore capitalization-related spelling errors)
                $spell_error = true;
                $suggest_words[$index] = $s[0];
            }
        }
    }
}

The $spelling_error variable is set any time a spelling error is encountered as a flag, so the script can decided whether a query showing the suggested corrections should be returned to the user alongside the results.

if ($spell_error)
{
    foreach ($words as $index => $word)
    {
        if (isset($suggest_words[$index]))
        {
            $words[$index] = $suggest_words[$index];
        }
    }
    echo '<p>Possible misspelling. Did you mean <a href="' .
        htmlspecialchars($_SERVER['PHP_SELF']) .'?query=' .
        urlencode(htmlspecialchars(join(' ', $words))) . '">' .
        htmlspecialchars(join(' ', $words)) . '</a>?</p>';
}

Here is the complete code for public_files/search.php. Figure 4-2 shows the front end in action displaying search results and Figure 4-3 shows a suggested spelling correction.

<?php
// include shared code
include '../lib/common.php';
include '../lib/db.php';
include '../lib/functions.php';

// accept incoming search terms if the form has been submitted
$words = array();
if (isset($_GET['query']) && trim($_GET['query']))
{
    $words = explode_items($_GET['query'], ' ', false);

    // remove stop words from query
    $query = sprintf('SELECT TERM_VALUE FROM %sSEARCH_STOP_WORD',
        DB_TBL_PREFIX);
    $result = mysql_query($query, $GLOBALS['DB']);
    $stop_words = array();
    while ($row = mysql_fetch_assoc($result))
    {
        $stop_words[$row['TERM_VALUE']] = true;
    }
    mysql_free_result($result);

    $words_removed = array();
    foreach ($words as $index => $word)
    {
        if (isset($stop_words[strtolower($word)]))
        {
            $words_removed[] = $word;
            unset($words[$index]);
        }
    }
}
// generate HTML form
ob_start();
?>
<form method="get"
 action="<?php echo htmlspecialchars($_SERVER['PHP_SELF']); ?>">
 <div>
  <input type="text" name="query" id="query" value="<?php
   echo (count($words)) ? htmlspecialchars(join(' ', $words)) : '';?>"/>
  <input type="submit" value="Search"/>
 </div>
</form>
<?php

// begin processing query
if (count($words))
{
    // spell check the query words
    $spell_error = false;
    $suggest_words = array();

$ps = pspell_new('en'),
    foreach ($words as $index => $word)
    {
        if (!pspell_check($ps, $word))
        {
            if ($s = pspell_suggest($ps, $word))
            {
                if (strtolower($s[0]) != strtolower($word))
                {
                    // (ignore capitalization-related spelling errors)
                    $spell_error = true;
                    $suggest_words[$index] = $s[0];
                }
            }
        }
    }

    // formulate the search query using provided terms and submit it
    $join = '';
    $where = '';
    $query = 'SELECT DISTINCT D.DOCUMENT_URL, D.DOCUMENT_TITLE, ' .
        'D.DESCRIPTION FROM WROX_SEARCH_DOCUMENT D ';
    foreach ($words as $index => $word)
    {
        $join .= sprintf(
            'JOIN WROX_SEARCH_INDEX I%d ON D.DOCUMENT_ID = I%d.DOCUMENT_ID ' .
            'JOIN WROX_SEARCH_TERM T%d ON I%d.TERM_ID = T%d.TERM_ID ',
            $index, $index, $index, $index, $index);

        $where .= sprintf('T%d.TERM_VALUE = "%s" AND ', $index,
            mysql_real_escape_string(strtolower($word), $GLOBALS['DB']));
    }
    $query .= $join . 'WHERE ' . $where;
    // trimmed 4 characters o remove trailing ' AND'
    $query = substr($query, 0, strlen($query) - 4);
    $result = mysql_query($query, $GLOBALS['DB']);

    // display results
    echo '<hr/>';
    $num_rows = mysql_num_rows($result);
    echo '<p>Search for <b>' . htmlspecialchars(join(' ', $words)) .
        '</b> yielded ' . $num_rows . ' result' .
        (($num_rows != 1) ? 's' : '') . ':</p>';

    // show suggested query if a possible misspelling was found
    if ($spell_error)
    {
        foreach ($words as $index => $word)
        {
            if (isset($suggest_words[$index]))

{
                $words[$index] = $suggest_words[$index];
            }
        }
        echo '<p>Possible misspelling. Did you mean <a href="' .
            htmlspecialchars($_SERVER['PHP_SELF']) .'?query=' .
            urlencode(htmlspecialchars(join(' ', $words))) . '">' .
            htmlspecialchars(join(' ', $words)) . '</a>?</p>';
    }

    echo '<ul>';
    while ($row = mysql_fetch_assoc($result))
    {
        echo '<li><b><a href="' .
            htmlspecialchars($row['DOCUMENT_URL']) . '">' .
            htmlspecialchars($row['DOCUMENT_TITLE']) . '</a></b>- ' .
            htmlspecialchars($row['DESCRIPTION']) . '<br/><i>' .
            htmlspecialchars($row['DOCUMENT_URL']) . '</i></li>';
    }
    echo '</ul>';
}
$GLOBALS['TEMPLATE']['content'] = ob_get_clean();

// display the page
include '../templates/template-page.php';
?>

Figure 4-2. Figure 4-2

Figure 4-3. Figure 4-3

To integrate the search engine into your existing web site, place a form that submits its query to search.php to any page you see fit. Afterwards, log into admin.php and enter the address of any page you want to be searched as well as a list of stop words. After the addresses and stop words have been entered into the database, indexer.php can be run to build the inverted-index.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.186.92