Building the full-text search feature

As you've learned earlier, full-text search consists of two parts—indexing and searching. To index a file, you will need to be able to extract keywords from the file. As files come in all types and formats (for example: .PDF, .DOCX, .XLSX, .CAD), it will be difficult to try to extract keywords from every type of file out there.

However, what you can do is build an extensible framework that lets you easily add support for more file formats in the future. You will build a common interface for Keyword Extractor classes—the IKeywordExtractor interface. Each keyword extractor will handle a specific file format, and its sole function will be to retrieve a set of keywords from that file.

Creating the Keyword Extractor classes

The Keyword Extractor interface defines three abstract methods. These methods are self-explanatory, a Keyword Extractor class must be able to open a file, extract keywords (into a string), and subsequently close the file.

public interface IKeywordExtractor
{
bool Open(string filePath);
string ExtractKeywords();
bool Close();
}

We also create a KeywordExtractorBase class that offers common functionality across all keyword extractors. When you extract keywords from a file, you would most likely need to throw away common words that you don't need to index. For example, words like 'a', 'the', and 'an' do not need to be indexed. These words are called stop words. We can strip away stop words using Regular Expressions (as shown in the following highlighted function):

public class KeywordExtractorBase
{
private string[] _stopWords;
public string RemoveStopWords(string RawText)
{
string _regexPattern;
int _counter;
for (_counter = 0; _counter <= (_stopWords.Length - 1);
_counter++)
{
_stopWords[_counter] = "\b" + _stopWords[_counter] +
"\b";
}
_regexPattern = "(" + string.Join("|", _stopWords) +
")";
return Regex.Replace(RawText, _regexPattern, "",
RegexOptions.IgnoreCase);
}
public KeywordExtractorBase()
{
_stopWords = new string[] {"a", "and", "the", "of", "but"};
}
}

A sample keyword extractor—the HTML Keyword Extractor

When the user uploads a HTML file (for instance, a downloaded web page) to your application, it is readily indexable as it contains mostly text instead of binary information. However, a simple web page may contain lots of HTML tags that don't need to be indexed. Take a look at the following sample:

<HTML>
<BODY>
This is my <font class='MainFont'>sample website</font>,
there are simply too many <b>tags</b> here
</BODY>
</HTML>

In a heavily formatted web page, the HTML tags not only add up to a lot of wasted space but also increase search inaccuracy, as these tags are picked up by the search as well.

The HTML Keyword Extractor is an implementation of a keyword extractor that strips away the HTML tags from a document that is about to be indexed. This can be quite easily done via pattern matching using regular expressions. The following snippet of code shows how the HTMLKeywordExtractor class can be implemented.

public class HTMLKeywordExtractor : KeywordExtractorBase, IKeywordExtractor
{
private StreamReader _reader;
public bool Open(string filePath)
{
_reader = File.OpenText(filePath);
return false;
}
public bool Close()
{
_reader.Close();
return false;
}
private string RemoveHTMLTags(string RawText)
{
return Regex.Replace(RawText, "<(.|\n)*?>",
string.Empty);
}
public string ExtractKeywords()
{
string _rawText;
_rawText = _reader.ReadToEnd();
_rawText = RemoveHTMLTags(_rawText);
_rawText = MyBase.RemoveStopWords(_rawText)
return _rawText;
}
}

You can also build your own keyword extractors to handle other file formats. For example, a PDF file is not readily indexable as its content is stored in binary form. It's a great boon to users if they can search the content within their PDF files. To implement this, you could build another keyword extractor to extract the text from a PDF file. This is out of the scope of this book of course, but with the Keyword Extractor interface and base classes, you have the infrastructure to easily add support for more file formats in the future.

Indexing the file

We can have the application automatically index a file each time the user uploads one via the FileDetailViewer window created in the previous chapter (shown in the following screenshot).

Indexing the file

You need to add the IndexFile() function to the File class that you've created in the previous chapter. This function checks the extension of the uploaded file and instantiates the corresponding keyword extractor (if one is available). The extracted keywords are then saved to the Keywords property, which eventually ends up being written to the Keywords column of the AccountFiles table.

Tip

You need to first add a new Keywords column to the AccountFiles table in your database. You should make it a TEXT field (in SQL Server CE) or a CLOB field (in Oracle Lite) as you might end up with a very large set of keywords.

public void IndexFile()
{
string _attachmentName;
IKeywordExtractor _keywordExtractor;
_attachmentName = Attachment;
switch
(System.IO.Path.GetExtension(_attachmentName).ToLower())
{
case "html":
_keywordExtractor = new HTMLKeywordExtractor();
break;
default:
return;
}
_keywordExtractor.Open(_attachmentName);
Keywords = _keywordExtractor.ExtractKeywords();
_keywordExtractor.Close();
}

As a last step, you can call the IndexFile() function from the click event handler of the Save button in the FileDetailViewer window to invoke the indexing process.

Now that you have managed to extract keywords from uploaded files and save them to the database, we will look at how you can create an SQL query that can make use of these keywords to perform a Boolean search.

Creating the full-text search query for SQL Server CE

As you've learned earlier, SQL Server CE does not support any full-text search functionality and, therefore, does not come with the CONTAINS() and FREETEXT() T-SQL functions that would otherwise have made full-text search an easier task.

The closest equivalent you are given is the CHARINDEX(sequence,expression) and PATINDEX(pattern,expression) T-SQL functions. These functions search text-based SQL fields for a given sequence or pattern and return the starting point of the pattern in the text (if it is found). Consider the following table for instance:

TABLE [Fruits]
----------------------------
FruitName FruitDescription
----------------------------
Apple I love apples

Running the following SQL would produce the result 8 (the starting position of the word 'Apple' in the preceding text)

SELECT CHARINDEX('apple',FruitDescription) FROM [Fruits]

The PATINDEX() function gives you a little more control, in that you can specify a small subset of regular expression patterns. An example follows:

SELECT CHARINDEX('%apple[0-9]%',FruitDescription) FROM [Fruits]

So, how does the CHARINDEX() function translate to the full-text search feature? Let's take a look at an example. If the user, for instance, wanted to search for all documents containing either the Medical or Certificates keyword, he or she would key in Medical OR Certificates as the search phrase. You could basically retrieve a listing of the uploaded files matching these keywords using the following SQL:

SELECT * FROM AccountFiles WHERE CHARINDEX( 'Medical', Keywords)>0 OR CHARINDEX('Certificates', Keywords)>0

Note

The CHARINDEX()and PATINDEX() T-SQL functions, unlike the LIKE keyword, work well on large data types such as TEXT or NTEXT.

You now need to find a way to translate a Boolean search phrase into an SQL WHERE clause. The user can key in search phrases in many different ways. A few are shown in the following table:

Search phrase example

Description

Medical OR Certificates OR Receipts

Containing any of the 'Medical', 'Certificates' or 'Receipts,' keywords

Medical OR Certificates NOT Clinic

Containing either the 'Medical' or the 'Certificates' keywords but must not contain the 'Clinic' keyword

Medical Certificates OR Medical Receipts

Containing either the 'Medical Certificates' phrase or the 'Medical Receipts' phrase

NOT Medical

Any document that does not contain the 'Medical' keyword

You can create a BuildWhereClause() function that takes in a search phrase keyed in by the user and converts it into SQL. The code is as follows:

private string BuildWhereClause(string SearchPhrase)
{
int _counter;
int _counter2;
string[] _NOTPhrases;
string[] _ORPhrases;
string _NOTPhrase;
string _ORPhrase;
string _WhereClause;
string _SubWhereClause;
_WhereClause = "";
//Split the search phrase using 'NOT' as the delimiter
//The  Regex special character specifies a word boundary
//and basically tells Regex to only consider occurrences of
//'NOT' that are standalone words
_NOTPhrases = Regex.Split(SearchPhrase, "\bNOT\b",
RegexOptions.IgnoreCase);
//Loop through the split parts
for (_counter = 0; _counter <= (_NOTPhrases.Length - 1);
_counter++)
{
_NOTPhrase = _NOTPhrases[_counter].Trim();
if (_NOTPhrase.Length > 0)
{
//Split this part further using 'OR' as the delimiter
_ORPhrases = Regex.Split(_NOTPhrase, "\bOR\b",
RegexOptions.IgnoreCase);
_SubWhereClause = "";
//Here we enter another loop
for (_counter2 = 0; _counter2 <= (_ORPhrases.Length
1); _counter2++)
{
_ORPhrase = _ORPhrases[_counter2].Trim();
if (_ORPhrase.Length > 0)
{
//Generate the CHARINDEX() expression for each
//word in the search phrase and attaches them //together using the OR operator
_SubWhereClause += ((_SubWhereClause.Length >
0) ? " OR " : "") + "CHARINDEX('" +
_ORPhrase + "',Keywords)>0";
}
}
if (_SubWhereClause.Length > 0)
{
//As we are still inside a loop that runs through //phrases separated by the NOT operator, //we reattach each formatted portion together
//using the NOT operator
_WhereClause += ((_WhereClause.Length > 0) ? " AND " : "") + (_counter > 0 ? "NOT " : "") +
_SubWhereClause;
}
}
}
if (_WhereClause.Length > 0)
{
//If the WHERE clause is not empty, we append the
//'WHERE' SQL keyword at the front
_WhereClause = "WHERE " + _WhereClause;
}
return _WhereClause;
}

As an example, passing the search phrase "Medical OR Certificate NOT Clinic" through the function above will produce the following:

WHERE CHARINDEX('Medical',Keywords)>0 OR CHARINDEX('Certificate',Keywords)>0 AND NOT CHARINDEX('Clinic',Keywords)>0

As you've seen in the brief walk-through earlier on in this chapter, the full-text search results are displayed differently. Instead of the usual DataGrid listing, we choose to display in similar fashion to online search engines, which is to display each search result item with a text summary at the bottom.

Creating the full-text search query for SQL Server CE

The text summary displayed isn't just a random snippet of text from the Keywords field. It must contain at least one of the search phrase keywords typed in by the user for the display to be meaningful. These keywords are highlighted in bold in the text summary.

To retrieve these text summaries, you can use another T-SQL function available in SQL Server CE—the SUBSTRING(expression, starting_position, length) function. This function simply extracts a chunk of text (with a specified length) from a specified field and starting position. Hence, assuming you have the following data:

TABLE [AccountFiles]
---------------------------------------
Keywords
---------------------------------------
This Medical Certificate proves that...

If you combine the CHARINDEX() function with the SUBSTRING() function in the following manner:

SELECT SUBSTRING(Keywords,CHARINDEX('Medical',Keywords),12) FROM AccountFiles

You will get the following result:

Medical Cert

You can use these two functions to extract the first chunk of text that contains the search phrase keywords. Keeping in mind that a search phrase may consist of more than one set of keywords (attached via the OR operator), you need to call these two functions for every OR keyword in the search phrase. The search phrase Medical OR Certificate NOT Clinic will hence translate to the following:

SELECT SUBSTRING(Keywords,CHARINDEX('Medical',Keywords),100) AS [TextSummary1], SUBSTRING(Keywords,CHARINDEX('Certificate',Keywords), 100) AS [TextSummary2] FROM AccountFiles

Let's write some code to do this translation:

private string BuildSelectClause(string SearchPhrase)
{
int _counter;
int _counter2 = 1;
string[] _NOTPhrases;
string[] _ORPhrases;
string _ActivePhrase;
string _ORPhrase;
string _SelectClause;
_SelectClause = "";
//Split the search phrase at the NOT operator
_NOTPhrases = Regex.Split(SearchPhrase, "\bNOT\b",
RegexOptions.IgnoreCase);
//Everything after the NOT is not needed. We only need the //part on the left of the NOT operator
_ActivePhrase = _NOTPhrases[0].Trim();
if (_ActivePhrase.Length > 0)
{
//Split the phrase using the OR operator
_ORPhrases = Regex.Split(_ActivePhrase, "\bOR\b",
RegexOptions.IgnoreCase);
for (_counter = 0; _counter <= (_ORPhrases.Length - 1);
_counter++)
{
_ORPhrase = _ORPhrases[_counter].Trim();
if (_ORPhrase.Length > 0)
{
//For each OR operator, we call the SUBSTRING and
//CHARINDEX functions. Take note that we deduct 30
//characters from the starting point so that the //keyword does not appear exactly as the first //word in the text summary. This is purely for
//aesthetic value
_SelectClause += ((_SelectClause.Length > 0) ? ","
: "") + "SUBSTRING(Keywords,CHARINDEX('" +
_ORPhrase + "',Keywords)-30,120) AS
[TextSummary" + _counter2.ToString() + "]";
_counter2++;
}
}
}
if (_SelectClause.Length == 0)
{
//If no keyword was specified in the search phrase, we
//simply return the first 120 characters of the text
_SelectClause = "SUBSTRING(Keywords,0,120) AS
TextSummary1";
}
return _SelectClause;
}

Now that you've created this function, you can proceed to build the full-text search queries. The full-text search function is expected to return paged data, so this means that you will again implement two functions—one to return the search result count and another to return paged data. The code is very similar to the previous ones you've done. The code for the count function follows. It makes use of the BuildWhereClause() function you've created earlier.

public int GetAccountFilesCountBySearchPhrase(string
SearchPhrase)
{
.
.
.
_whereClause = BuildWhereClause(SearchPhrase);
_command.CommandText = "SELECT COUNT(*) AS RecordCount FROM
AccountFiles " + _whereClause;
.
.
.
}

The code to retrieve the paged data follows next. It makes use of both the BuildWhereClause() and BuildSelectClause() functions you've created.

public DataSet GetAccountFilesBySearchPhrase(string
SearchPhrase, int TotalRecords, int PageNumber, int
PageSize, string SortColumn, GlobalVariables.SortingOrder
SortDirection)
{
.
.
.
_command = _globalConnection.CreateCommand();
_whereClause = BuildWhereClause(SearchPhrase);
_textSummaryClause = BuildSelectClause(SearchPhrase);
_command.CommandText = "SELECT * FROM (SELECT TOP(" +
_pageRecordCount + ") * FROM (SELECT TOP(" +
_initialSelectSize + ") b.FirstName, b.LastName,
b.AccountType, a.AccountGUID,a.AttachmentID,
a.AttachmentName, a.AttachmentSize, a.Attachment," +
_textSummaryClause + " FROM AccountFiles a LEFT JOIN
Accounts b ON a.AccountGUID=b.AccountGUID " +
_whereClause + " ORDER BY " + _sortColumn + " " + _sortDirection + ", AttachmentID DESC) AS [mytable]
ORDER BY " + _sortColumn + " " + _sortOppDirection + ",
AttachmentID ASC) AS [mytable2] ORDER BY " + _sortColumn
+ " " + _sortDirection + ",AttachmentID DESC";
.
.
.
}

Let's now take a look at the equivalent for Oracle Lite.

Creating the full-text search query for Oracle Lite

In Oracle Lite, the equivalent of the CHARINDEX() and SUBSTR() functions are the INSTR() and SUBSTR() functions. Unfortunately, these two functions can only work on the CHAR or VARCHAR data types and not on the LONG or CLOB data types. This presents a problem for us.

CHAR or VARCHAR data types are small by nature (they can only fit up to a maximum of 4,096 bytes), whereas the LONG and CLOB data types allow us to store data up to roughly 2 gigabytes in size. If we don't use the LONG and CLOB data types, we cannot store keywords of any significant size. (A long text document, for instance, can easily contain thousands of unique words).

So, how do we use the CLOB data type to store keywords and yet retain the ability to perform a full-text search? Fortunately, Oracle Lite allows you to use the SQL LIKE operator to search for an expression within a CLOB field. We will use this in place of the CHARINDEX() function.

However, there is no other equivalent workaround to obtain the location of a phrase in a CLOB field. So, in other words, we can retrieve a list of files matching the search phrase but we can't determine where the search phrase occurs in the file.

Hence, for the text summary that is displayed with the search results, we will simply display the first 120 bytes of the text in the matching file. Let's take a look at the BuildWhereClause() function for Oracle Lite. The differences in the Oracle Lite code are highlighted in the following code:

private string BuildWhereClause(string SearchPhrase)
{
int _counter;
int _counter2;
string[] _NOTPhrases;
string[] _ORPhrases;
string _NOTPhrase;
string _ORPhrase;
string _WhereClause;
string _SubWhereClause;
_WhereClause = "";
//Split the search phrase using the NOT operator
_NOTPhrases = Regex.Split(SearchPhrase, "\bNOT\b",
RegexOptions.IgnoreCase);
for (_counter = 0; _counter <= (_NOTPhrases.Length - 1);
_counter++)
{
_NOTPhrase = _NOTPhrases[_counter].Trim();
if (_NOTPhrase.Length > 0)
{
//Split the phrase further using the OR operator
_ORPhrases = Regex.Split(_NOTPhrase, "\bOR\b",
RegexOptions.IgnoreCase);
_SubWhereClause = "";
for (_counter2 = 0; _counter2 <= (_ORPhrases.Length
1); _counter2++)
{
_ORPhrase = _ORPhrases[_counter2].Trim();
if (_ORPhrase.Length > 0)
{
//We use the LIKE operator to do the comparison
_SubWhereClause += ((_SubWhereClause.Length > 0) ? " OR " : "") + "a.Keywords LIKE '%" + _ORPhrase + "%'";
}
}
if (_SubWhereClause.Length > 0)
{
_WhereClause += ((_WhereClause.Length > 0) ? " AND " : "") + (_counter > 0 ? "NOT " : "") +_SubWhereClause;
}
}
}
if (_WhereClause.Length > 0)
{
_WhereClause = "WHERE " + _WhereClause;
}
return _WhereClause;
}

The GetAccountFilesBySearchPhrase() function in Oracle is similar to the standard paged data retrieval function. The main differences are shown below. For Oracle Lite, you do not have to create a corresponding BuildSelectClause() function. It will always return the first 120 characters in the matching file.

Tip

SQL joins in Oracle Lite

Take note that Oracle Lite does not support the ANSI join syntax (a LEFT JOIN b). It uses the ODBC join syntax instead. ({OJ a LEFT JOIN b}).

public DataSet GetAccountFilesBySearchPhrase(string
SearchPhrase, int TotalRecords, int PageNumber, int
PageSize, string SortColumn, GlobalVariables.SortingOrder
SortDirection)
{
.
.
.
_whereClause = BuildWhereClause(SearchPhrase);
_command.CommandText = "SELECT ROWNUM,
TO_CHAR(a.AccountGUID) AS AccountGUID, b.FirstName,
b.LastName, b.AccountType, a.AttachmentID,
a.AttachmentName, a.AttachmentSize,
a.Attachment,SUBSTR(CAST(a.Keywords AS
VARCHAR(120)),1,120) AS TextSummary1 FROM {OJ
AccountFiles a LEFT JOIN Accounts b ON
a.AccountGUID=b.AccountGUID} " + _whereClause +
((_whereClause.Length > 0) ? " AND " : " WHERE ") + " ROWNUM>=" + _lowerlimit + " AND ROWNUM<=" + _upperlimit
+ " ORDER BY " + _sortColumn + " " + _sortDirection;
.
.
.
}

The GetAccountFilesCountBySearchPhrase() function is also similar in many respects. The main differences are shown in the following code snippet:

public int GetAccountFilesCountBySearchPhrase(string SearchPhrase)
{
.
.
.
_whereclause = BuildWhereClause(SearchPhrase);
_command.CommandText = "SELECT COUNT(*) AS RecordCount FROM
AccountFiles " + _whereclause;
.
.
.
}

Encapsulating the retrieved dataset using business objects

As usual, you will need to encapsulate the retrieved dataset using a custom business object class (and corresponding collection object). It's fairly easy to create two new classes: AccountFileSummary and AccountFileSummaryCollection based on what you've learned in this chapter and the previous one.

As a last step, you also need to add the two corresponding functions in the global Application class to retrieve the dataset and pass them to the instantiated business objects above.

Creating the full-text search forms

There are two forms you will need to build for the full-text search. The first form, the FullTextSearch form is simple—it presents a text box and a button for the user to key in his search phrase and run the search.

Creating the full-text search forms

On clicking the Search button, it simply passes the search phrase to the second form (which is the full-text search results summary page).

public void btnSearch_Click(System.Object sender,
System.EventArgs e)
{
NavigationService.ShowDialog("FullTextSearchSummary",
txtSearch.Text);
}

In your second form, the FullTextSearchSummary form, we no longer use a Datagrid control. The easiest way to display the search results as shown in the next screenshot would be to use HTML. To display HTML in our application, we will need to use the .NET Compact Framework's Webbrowser control.

Creating the full-text search forms

You will still need to use the Pager control as you need to be able to flip between multiple pages of data. You can generate HTML code from the retrieved dataset using the following code:

private void RefreshPage()
{
string _webOutput;
string _textSummary;
//Get a business object containing the list of account //files matching the search phrase
_accountFiles =
GlobalArea.Application.GetAccountFilesBySearchPhrase
(_searchPhrase, _totalRecords, pgPager.CurrentPage,
_recordsPerPage, "",
GlobalVariables.SortingOrder.Ascending);
_webOutput = "<html><body>";
foreach (AccountFileSummary _accountFile in
_accountFiles)
{
//Retrieve the text summary and highlight the
//keywords in bold
_textSummary = _accountFile.TextSummary;
_textSummary = HighlightKeywords(_textSummary,
_searchPhrase);
//Display a link that holds the account GUID and the //account type of the account
_webOutput = _webOutput + "<a
href='http://AccountInfo/" +
_accountFile.AccountGUID.ToString() + "," +
_accountFile.AccountType + "'><font
color='#0000FF' face='Tahoma' size='2'><b>" +
_accountFile.FirstName + " ";
_webOutput = _webOutput + _accountFile.LastName +
"</b><font></a><br>";
//Display a link with the full path of the file attachment
_webOutput = _webOutput + "<font color='#000000'
face='Tahoma' size='1'><b>" +
_accountFile.AttachmentName + " (" +
Strings.Format(_accountFile.AttachmentSize,
"###,###,###") + ") bytes" + "</b></font>" +
"&nbsp;&nbsp;<a href='http://FileID/" +
_accountFile.Attachment + "'><font color='#0000FF'
face='Tahoma' size='2'><b>Open
file</b></font></a><br>";
_webOutput = _webOutput + "<font color='#000000'
face='Tahoma' size='1'>" + _textSummary +
"</font><br>";
_webOutput = _webOutput + "<br>";
}
_webOutput = _webOutput + "</body></html>";
//Assign the HTML string to the WebBrowser control, and //call the Show() function to display the HTML
wbSearchSummary.DocumentText = _webOutput;
wbSearchSummary.Show();
}

To highlight the list of keywords in the text summary, create the HighlightKeywords() function as in the following code:

Private Function HighlightKeywords(ByVal TextSummary As
String, ByVal SearchPhrase As String) As String
Dim arrKeywords() As String
Dim _keyword As String
Dim _counter As Integer
arrKeywords = Regex.Split(SearchPhrase, "OR|NOT",
RegexOptions.IgnoreCase)
For _counter = 0 To UBound(arrKeywords)
_keyword = Trim(arrKeywords(_counter))
If Len(_keyword) > 0 Then
TextSummary = Regex.Replace(TextSummary, _keyword,
"<b>" & _keyword & "</b>",RegexOptions.IgnoreCase)
End If
Next _counter
Return TextSummary
End Function

When the user clicks on any of the links on this page, the WebBrowser control will receive a Navigating event notification. We can use this mechanism to determine which link the user clicked on by inspecting the URL of the link (which contains the desired information written by the RefreshPage() function shown previously).

public void wbSearchSummary_Navigating(object sender,
System.Windows.Forms.WebBrowserNavigatingEventArgs e)
{
string _temp;
System.Guid _accountGUID;
BaseAccount _account;
string _filePath;
string[] _info;
BaseAccount.AccountTypes _type;
//User clicks on the 'Account Name' link
if (e.Url.ToString().ToLower().StartsWith
("http://accountinfo/"))
{
e.Cancel = true;
_temp = e.Url.ToString().Substring
("http://accountinfo/".Length);
_info = Regex.Split(_temp, ",",
RegexOptions.IgnoreCase);
if ((_info.Length - 1) == 1)
{
_accountGUID = new System.Guid(_info[0]);
_type = (CRMLive.BaseAccount.AccountTypes)
(_info[1]);
_account =
GlobalArea.Application.GetAccount(_accountGUID,
_type);
NavigationService.ShowDialog("Edit" +
BaseAccount.AccountTypeToString(_type), ((object)
_account));
}
}
//User clicks on an 'Open File' link
if (e.Url.ToString().ToLower().StartsWith
("http://fileid/"))
{
e.Cancel = true;
_filePath = e.Url.ToString().Substring
("http://fileid/".Length);
if (MessageBox.Show("Are you sure you wish to open this
file","Open file",MessageBoxButtons.YesNo
,MessageBoxIcon.Question, MessageBoxDefaultButton.Button1 )==DialogResult.Yes)
{
Process.Start(_filePath, "");
}
}
}

Trying out the full-text search

To try out the full-text search form, you will need to add another icon to the main menu form. You will also need to create a new entry in the NavigationService class to launch the FullTextSearch form.

Once you have this setup, try launching the full-text search form. You need to of course, index a file before you can search for it, so create a new lead account and upload a few text-based files (such as HTML files) in the file attachments tab.

Once you have done that, navigate to the full-text search form, and type in a search phrase. You can use the OR, AND, and NOT operators to narrow down your search. Assuming the search keywords exist in the indexed files, you will be able to see a list of the matching files.

Improving the full-text search engine

The search engine you have created so far works, but there is certainly much room for improvement. A few ways you can further improve this search engine are as follows:

  • Creating a DOCX and PDF keyword extractor:

    The Microsoft Word DOCX and Adobe PDF formats are popular file attachment types. Letting your users search through DOCX and PDF content can indeed provide a lot of business value. You can easily create new keyword extractors for these fi le formats by leveraging the framework you have created.

  • Supporting nested Boolean queries:

    You can also build support for nested Boolean queries. A sample nested Boolean query might look like this: ((Medical OR Certificate) NOT Clinic) OR Hospital.

  • Creating a more comprehensive list of stop words to reduce the size of generated keywords:

    By increasing the size of the stop words list, you can increase the accuracy of your search by stripping away irrelevant text. Take note, however, that increasing the size of the stop words list means your application will need to incur more processing cycles to strip away this data. You will need to fi nd the right balance between accuracy and performance for your project.

  • Implementing search result ranking:

    You can also easily implement a search-result ranking system roughly similar to that used by Google. By keeping track of a popularity counter for each search result, and incrementing it every time the user opens that particular item, you can sort the search results by this popularity counter in descending order to show the most popular result at the top.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.112.69