As you've learned earlier, full-text search consists of two parts—indexing and searching. To index a file, you will need to be able to extract keywords from the file. As files come in all types and formats (for example: .PDF, .DOCX, .XLSX, .CAD
), it will be difficult to try to extract keywords from every type of file out there.
However, what you can do is build an extensible framework that lets you easily add support for more file formats in the future. You will build a common interface for Keyword Extractor classes—the IKeywordExtractor
interface. Each keyword extractor will handle a specific file format, and its sole function will be to retrieve a set of keywords from that file.
The Keyword Extractor interface defines three abstract methods. These methods are self-explanatory, a Keyword Extractor class must be able to open a file, extract keywords (into a string), and subsequently close the file.
public interface IKeywordExtractor { bool Open(string filePath); string ExtractKeywords(); bool Close(); }
We also create a KeywordExtractorBase
class that offers common functionality across all keyword extractors. When you extract keywords from a file, you would most likely need to throw away common words that you don't need to index. For example, words like 'a', 'the', and 'an' do not need to be indexed. These words are called stop words. We can strip away stop words using Regular Expressions (as shown in the following highlighted function):
public class KeywordExtractorBase
{
private string[] _stopWords;
public string RemoveStopWords(string RawText)
{
string _regexPattern;
int _counter;
for (_counter = 0; _counter <= (_stopWords.Length - 1);
_counter++)
{
_stopWords[_counter] = "\b" + _stopWords[_counter] +
"\b";
}
_regexPattern = "(" + string.Join("|", _stopWords) +
")";
return Regex.Replace(RawText, _regexPattern, "",
RegexOptions.IgnoreCase);
}
public KeywordExtractorBase()
{
_stopWords = new string[] {"a", "and", "the", "of", "but"};
}
}
When the user uploads a HTML file (for instance, a downloaded web page) to your application, it is readily indexable as it contains mostly text instead of binary information. However, a simple web page may contain lots of HTML tags that don't need to be indexed. Take a look at the following sample:
<HTML> <BODY> This is my <font class='MainFont'>sample website</font>, there are simply too many <b>tags</b> here </BODY> </HTML>
In a heavily formatted web page, the HTML tags not only add up to a lot of wasted space but also increase search inaccuracy, as these tags are picked up by the search as well.
The HTML Keyword Extractor
is an implementation of a keyword extractor that strips away the HTML tags from a document that is about to be indexed. This can be quite easily done via pattern matching using regular expressions. The following snippet of code shows how the HTMLKeywordExtractor
class can be implemented.
public class HTMLKeywordExtractor : KeywordExtractorBase, IKeywordExtractor
{
private StreamReader _reader;
public bool Open(string filePath)
{
_reader = File.OpenText(filePath);
return false;
}
public bool Close()
{
_reader.Close();
return false;
}
private string RemoveHTMLTags(string RawText)
{
return Regex.Replace(RawText, "<(.|\n)*?>",
string.Empty);
}
public string ExtractKeywords()
{
string _rawText;
_rawText = _reader.ReadToEnd();
_rawText = RemoveHTMLTags(_rawText);
_rawText = MyBase.RemoveStopWords(_rawText)
return _rawText;
}
}
You can also build your own keyword extractors to handle other file formats. For example, a PDF file is not readily indexable as its content is stored in binary form. It's a great boon to users if they can search the content within their PDF files. To implement this, you could build another keyword extractor to extract the text from a PDF file. This is out of the scope of this book of course, but with the Keyword Extractor interface and base classes, you have the infrastructure to easily add support for more file formats in the future.
We can have the application automatically index a file each time the user uploads one via the FileDetailViewer
window created in the previous chapter (shown in the following screenshot).
You need to add the IndexFile()
function to the File
class that you've created in the previous chapter. This function checks the extension of the uploaded file and instantiates the corresponding keyword extractor (if one is available). The extracted keywords are then saved to the Keywords
property, which eventually ends up being written to the Keywords
column of the AccountFiles
table.
You need to first add a new Keywords
column to the AccountFiles
table in your database. You should make it a TEXT field (in SQL Server CE) or a CLOB field (in Oracle Lite) as you might end up with a very large set of keywords.
public void IndexFile() { string _attachmentName; IKeywordExtractor _keywordExtractor; _attachmentName = Attachment; switch (System.IO.Path.GetExtension(_attachmentName).ToLower()) { case "html": _keywordExtractor = new HTMLKeywordExtractor(); break; default: return; } _keywordExtractor.Open(_attachmentName); Keywords = _keywordExtractor.ExtractKeywords(); _keywordExtractor.Close(); }
As a last step, you can call the IndexFile()
function from the click event handler of the Save button in the FileDetailViewer
window to invoke the indexing process.
Now that you have managed to extract keywords from uploaded files and save them to the database, we will look at how you can create an SQL query that can make use of these keywords to perform a Boolean search.
As you've learned earlier, SQL Server CE does not support any full-text search functionality and, therefore, does not come with the CONTAINS()
and FREETEXT()
T-SQL functions that would otherwise have made full-text search an easier task.
The closest equivalent you are given is the CHARINDEX(sequence,expression)
and PATINDEX(pattern,expression)
T-SQL functions. These functions search text-based SQL fields for a given sequence or pattern and return the starting point of the pattern in the text (if it is found). Consider the following table for instance:
TABLE [Fruits] ---------------------------- FruitName FruitDescription ---------------------------- Apple I love apples
Running the following SQL would produce the result 8
(the starting position of the word 'Apple' in the preceding text)
SELECT CHARINDEX('apple',FruitDescription) FROM [Fruits]
The PATINDEX()
function gives you a little more control, in that you can specify a small subset of regular expression patterns. An example follows:
SELECT CHARINDEX('%apple[0-9]%',FruitDescription) FROM [Fruits]
So, how does the CHARINDEX()
function translate to the full-text search feature? Let's take a look at an example. If the user, for instance, wanted to search for all documents containing either the Medical
or Certificates
keyword, he or she would key in Medical
OR Certificates
as the search phrase. You could basically retrieve a listing of the uploaded files matching these keywords using the following SQL:
SELECT * FROM AccountFiles WHERE CHARINDEX( 'Medical', Keywords)>0 OR CHARINDEX('Certificates', Keywords)>0
The CHARINDEX()
and PATINDEX()
T-SQL functions, unlike the LIKE
keyword, work well on large data types such as TEXT
or NTEXT
.
You now need to find a way to translate a Boolean search phrase into an SQL WHERE
clause. The user can key in search phrases in many different ways. A few are shown in the following table:
Search phrase example |
Description |
---|---|
|
Containing any of the 'Medical', 'Certificates' or 'Receipts,' keywords |
|
Containing either the 'Medical' or the 'Certificates' keywords but must not contain the 'Clinic' keyword |
|
Containing either the 'Medical Certificates' phrase or the 'Medical Receipts' phrase |
|
Any document that does not contain the 'Medical' keyword |
You can create a BuildWhereClause()
function that takes in a search phrase keyed in by the user and converts it into SQL. The code is as follows:
private string BuildWhereClause(string SearchPhrase) { int _counter; int _counter2; string[] _NOTPhrases; string[] _ORPhrases; string _NOTPhrase; string _ORPhrase; string _WhereClause; string _SubWhereClause; _WhereClause = ""; //Split the search phrase using 'NOT' as the delimiter //The Regex special character specifies a word boundary //and basically tells Regex to only consider occurrences of //'NOT' that are standalone words _NOTPhrases = Regex.Split(SearchPhrase, "\bNOT\b", RegexOptions.IgnoreCase); //Loop through the split parts for (_counter = 0; _counter <= (_NOTPhrases.Length - 1); _counter++) { _NOTPhrase = _NOTPhrases[_counter].Trim(); if (_NOTPhrase.Length > 0) { //Split this part further using 'OR' as the delimiter _ORPhrases = Regex.Split(_NOTPhrase, "\bOR\b", RegexOptions.IgnoreCase); _SubWhereClause = ""; //Here we enter another loop for (_counter2 = 0; _counter2 <= (_ORPhrases.Length 1); _counter2++) { _ORPhrase = _ORPhrases[_counter2].Trim(); if (_ORPhrase.Length > 0) { //Generate the CHARINDEX() expression for each //word in the search phrase and attaches them //together using the OR operator _SubWhereClause += ((_SubWhereClause.Length > 0) ? " OR " : "") + "CHARINDEX('" + _ORPhrase + "',Keywords)>0"; } } if (_SubWhereClause.Length > 0) { //As we are still inside a loop that runs through //phrases separated by the NOT operator, //we reattach each formatted portion together //using the NOT operator _WhereClause += ((_WhereClause.Length > 0) ? " AND " : "") + (_counter > 0 ? "NOT " : "") + _SubWhereClause; } } } if (_WhereClause.Length > 0) { //If the WHERE clause is not empty, we append the //'WHERE' SQL keyword at the front _WhereClause = "WHERE " + _WhereClause; } return _WhereClause; }
As an example, passing the search phrase "Medical OR Certificate NOT Clinic"
through the function above will produce the following:
WHERE CHARINDEX('Medical',Keywords)>0 OR CHARINDEX('Certificate',Keywords)>0 AND NOT CHARINDEX('Clinic',Keywords)>0
As you've seen in the brief walk-through earlier on in this chapter, the full-text search results are displayed differently. Instead of the usual DataGrid listing, we choose to display in similar fashion to online search engines, which is to display each search result item with a text summary at the bottom.
The text summary displayed isn't just a random snippet of text from the Keywords
field. It must contain at least one of the search phrase keywords typed in by the user for the display to be meaningful. These keywords are highlighted in bold in the text summary.
To retrieve these text summaries, you can use another T-SQL function available in SQL Server CE—the SUBSTRING(expression, starting_position, length)
function. This function simply extracts a chunk of text (with a specified length) from a specified field and starting position. Hence, assuming you have the following data:
TABLE [AccountFiles] --------------------------------------- Keywords --------------------------------------- This Medical Certificate proves that...
If you combine the CHARINDEX()
function with the SUBSTRING()
function in the following manner:
SELECT SUBSTRING(Keywords,CHARINDEX('Medical',Keywords),12) FROM AccountFiles
You will get the following result:
Medical Cert
You can use these two functions to extract the first chunk of text that contains the search phrase keywords. Keeping in mind that a search phrase may consist of more than one set of keywords (attached via the OR
operator), you need to call these two functions for every OR keyword in the search phrase. The search phrase Medical OR Certificate NOT Clinic
will hence translate to the following:
SELECT SUBSTRING(Keywords,CHARINDEX('Medical',Keywords),100) AS [TextSummary1], SUBSTRING(Keywords,CHARINDEX('Certificate',Keywords), 100) AS [TextSummary2] FROM AccountFiles
Let's write some code to do this translation:
private string BuildSelectClause(string SearchPhrase) { int _counter; int _counter2 = 1; string[] _NOTPhrases; string[] _ORPhrases; string _ActivePhrase; string _ORPhrase; string _SelectClause; _SelectClause = ""; //Split the search phrase at the NOT operator _NOTPhrases = Regex.Split(SearchPhrase, "\bNOT\b", RegexOptions.IgnoreCase); //Everything after the NOT is not needed. We only need the //part on the left of the NOT operator _ActivePhrase = _NOTPhrases[0].Trim(); if (_ActivePhrase.Length > 0) { //Split the phrase using the OR operator _ORPhrases = Regex.Split(_ActivePhrase, "\bOR\b", RegexOptions.IgnoreCase); for (_counter = 0; _counter <= (_ORPhrases.Length - 1); _counter++) { _ORPhrase = _ORPhrases[_counter].Trim(); if (_ORPhrase.Length > 0) { //For each OR operator, we call the SUBSTRING and //CHARINDEX functions. Take note that we deduct 30 //characters from the starting point so that the //keyword does not appear exactly as the first //word in the text summary. This is purely for //aesthetic value _SelectClause += ((_SelectClause.Length > 0) ? "," : "") + "SUBSTRING(Keywords,CHARINDEX('" + _ORPhrase + "',Keywords)-30,120) AS [TextSummary" + _counter2.ToString() + "]"; _counter2++; } } } if (_SelectClause.Length == 0) { //If no keyword was specified in the search phrase, we //simply return the first 120 characters of the text _SelectClause = "SUBSTRING(Keywords,0,120) AS TextSummary1"; } return _SelectClause; }
Now that you've created this function, you can proceed to build the full-text search queries. The full-text search function is expected to return paged data, so this means that you will again implement two functions—one to return the search result count and another to return paged data. The code is very similar to the previous ones you've done. The code for the count function follows. It makes use of the BuildWhereClause()
function you've created earlier.
public int GetAccountFilesCountBySearchPhrase(string
SearchPhrase)
{
.
.
.
_whereClause = BuildWhereClause(SearchPhrase);
_command.CommandText = "SELECT COUNT(*) AS RecordCount FROM
AccountFiles " + _whereClause;
.
.
.
}
The code to retrieve the paged data follows next. It makes use of both the BuildWhereClause()
and BuildSelectClause()
functions you've created.
public DataSet GetAccountFilesBySearchPhrase(string
SearchPhrase, int TotalRecords, int PageNumber, int
PageSize, string SortColumn, GlobalVariables.SortingOrder
SortDirection)
{
.
.
.
_command = _globalConnection.CreateCommand();
_whereClause = BuildWhereClause(SearchPhrase);
_textSummaryClause = BuildSelectClause(SearchPhrase);
_command.CommandText = "SELECT * FROM (SELECT TOP(" +
_pageRecordCount + ") * FROM (SELECT TOP(" +
_initialSelectSize + ") b.FirstName, b.LastName,
b.AccountType, a.AccountGUID,a.AttachmentID,
a.AttachmentName, a.AttachmentSize, a.Attachment," +
_textSummaryClause + " FROM AccountFiles a LEFT JOIN
Accounts b ON a.AccountGUID=b.AccountGUID " +
_whereClause + " ORDER BY " + _sortColumn + " " + _sortDirection + ", AttachmentID DESC) AS [mytable]
ORDER BY " + _sortColumn + " " + _sortOppDirection + ",
AttachmentID ASC) AS [mytable2] ORDER BY " + _sortColumn
+ " " + _sortDirection + ",AttachmentID DESC";
.
.
.
}
In Oracle Lite, the equivalent of the CHARINDEX()
and SUBSTR()
functions are the INSTR()
and SUBSTR()
functions. Unfortunately, these two functions can only work on the CHAR
or VARCHAR
data types and not on the LONG
or CLOB
data types. This presents a problem for us.
CHAR
or VARCHAR
data types are small by nature (they can only fit up to a maximum of 4,096 bytes), whereas the LONG
and CLOB
data types allow us to store data up to roughly 2 gigabytes in size. If we don't use the LONG
and CLOB
data types, we cannot store keywords of any significant size. (A long text document, for instance, can easily contain thousands of unique words).
So, how do we use the CLOB
data type to store keywords and yet retain the ability to perform a full-text search? Fortunately, Oracle Lite allows you to use the SQL LIKE
operator to search for an expression within a CLOB
field. We will use this in place of the CHARINDEX()
function.
However, there is no other equivalent workaround to obtain the location of a phrase in a CLOB
field. So, in other words, we can retrieve a list of files matching the search phrase but we can't determine where the search phrase occurs in the file.
Hence, for the text summary that is displayed with the search results, we will simply display the first 120 bytes of the text in the matching file. Let's take a look at the BuildWhereClause()
function for Oracle Lite. The differences in the Oracle Lite code are highlighted in the following code:
private string BuildWhereClause(string SearchPhrase)
{
int _counter;
int _counter2;
string[] _NOTPhrases;
string[] _ORPhrases;
string _NOTPhrase;
string _ORPhrase;
string _WhereClause;
string _SubWhereClause;
_WhereClause = "";
//Split the search phrase using the NOT operator
_NOTPhrases = Regex.Split(SearchPhrase, "\bNOT\b",
RegexOptions.IgnoreCase);
for (_counter = 0; _counter <= (_NOTPhrases.Length - 1);
_counter++)
{
_NOTPhrase = _NOTPhrases[_counter].Trim();
if (_NOTPhrase.Length > 0)
{
//Split the phrase further using the OR operator
_ORPhrases = Regex.Split(_NOTPhrase, "\bOR\b",
RegexOptions.IgnoreCase);
_SubWhereClause = "";
for (_counter2 = 0; _counter2 <= (_ORPhrases.Length
1); _counter2++)
{
_ORPhrase = _ORPhrases[_counter2].Trim();
if (_ORPhrase.Length > 0)
{
//We use the LIKE operator to do the comparison
_SubWhereClause += ((_SubWhereClause.Length > 0) ? " OR " : "") + "a.Keywords LIKE '%" + _ORPhrase + "%'";
}
}
if (_SubWhereClause.Length > 0)
{
_WhereClause += ((_WhereClause.Length > 0) ? " AND " : "") + (_counter > 0 ? "NOT " : "") +_SubWhereClause;
}
}
}
if (_WhereClause.Length > 0)
{
_WhereClause = "WHERE " + _WhereClause;
}
return _WhereClause;
}
The GetAccountFilesBySearchPhrase()
function in Oracle is similar to the standard paged data retrieval function. The main differences are shown below. For Oracle Lite, you do not have to create a corresponding BuildSelectClause()
function. It will always return the first 120 characters in the matching file.
SQL joins in Oracle Lite
Take note that Oracle Lite does not support the ANSI join syntax (a LEFT JOIN b
). It uses the ODBC join syntax instead. ({OJ a LEFT JOIN b
}).
public DataSet GetAccountFilesBySearchPhrase(string SearchPhrase, int TotalRecords, int PageNumber, int PageSize, string SortColumn, GlobalVariables.SortingOrder SortDirection) { . . . _whereClause = BuildWhereClause(SearchPhrase); _command.CommandText = "SELECT ROWNUM, TO_CHAR(a.AccountGUID) AS AccountGUID, b.FirstName, b.LastName, b.AccountType, a.AttachmentID, a.AttachmentName, a.AttachmentSize, a.Attachment,SUBSTR(CAST(a.Keywords AS VARCHAR(120)),1,120) AS TextSummary1 FROM {OJ AccountFiles a LEFT JOIN Accounts b ON a.AccountGUID=b.AccountGUID} " + _whereClause + ((_whereClause.Length > 0) ? " AND " : " WHERE ") + " ROWNUM>=" + _lowerlimit + " AND ROWNUM<=" + _upperlimit + " ORDER BY " + _sortColumn + " " + _sortDirection; . . . }
The GetAccountFilesCountBySearchPhrase()
function is also similar in many respects. The main differences are shown in the following code snippet:
public int GetAccountFilesCountBySearchPhrase(string SearchPhrase) { . . . _whereclause = BuildWhereClause(SearchPhrase); _command.CommandText = "SELECT COUNT(*) AS RecordCount FROM AccountFiles " + _whereclause; . . . }
As usual, you will need to encapsulate the retrieved dataset using a custom business object class (and corresponding collection object). It's fairly easy to create two new classes: AccountFileSummary
and AccountFileSummaryCollection
based on what you've learned in this chapter and the previous one.
As a last step, you also need to add the two corresponding functions in the global Application
class to retrieve the dataset and pass them to the instantiated business objects above.
There are two forms you will need to build for the full-text search. The first form, the FullTextSearch
form is simple—it presents a text box and a button for the user to key in his search phrase and run the search.
On clicking the Search button, it simply passes the search phrase to the second form (which is the full-text search results summary page).
public void btnSearch_Click(System.Object sender, System.EventArgs e) { NavigationService.ShowDialog("FullTextSearchSummary", txtSearch.Text); }
In your second form, the FullTextSearchSummary
form, we no longer use a Datagrid control. The easiest way to display the search results as shown in the next screenshot would be to use HTML. To display HTML in our application, we will need to use the .NET Compact Framework's Webbrowser
control.
You will still need to use the Pager control as you need to be able to flip between multiple pages of data. You can generate HTML code from the retrieved dataset using the following code:
private void RefreshPage() { string _webOutput; string _textSummary; //Get a business object containing the list of account //files matching the search phrase _accountFiles = GlobalArea.Application.GetAccountFilesBySearchPhrase (_searchPhrase, _totalRecords, pgPager.CurrentPage, _recordsPerPage, "", GlobalVariables.SortingOrder.Ascending); _webOutput = "<html><body>"; foreach (AccountFileSummary _accountFile in _accountFiles) { //Retrieve the text summary and highlight the //keywords in bold _textSummary = _accountFile.TextSummary; _textSummary = HighlightKeywords(_textSummary, _searchPhrase); //Display a link that holds the account GUID and the //account type of the account _webOutput = _webOutput + "<a href='http://AccountInfo/" + _accountFile.AccountGUID.ToString() + "," + _accountFile.AccountType + "'><font color='#0000FF' face='Tahoma' size='2'><b>" + _accountFile.FirstName + " "; _webOutput = _webOutput + _accountFile.LastName + "</b><font></a><br>"; //Display a link with the full path of the file attachment _webOutput = _webOutput + "<font color='#000000' face='Tahoma' size='1'><b>" + _accountFile.AttachmentName + " (" + Strings.Format(_accountFile.AttachmentSize, "###,###,###") + ") bytes" + "</b></font>" + " <a href='http://FileID/" + _accountFile.Attachment + "'><font color='#0000FF' face='Tahoma' size='2'><b>Open file</b></font></a><br>"; _webOutput = _webOutput + "<font color='#000000' face='Tahoma' size='1'>" + _textSummary + "</font><br>"; _webOutput = _webOutput + "<br>"; } _webOutput = _webOutput + "</body></html>"; //Assign the HTML string to the WebBrowser control, and //call the Show() function to display the HTML wbSearchSummary.DocumentText = _webOutput; wbSearchSummary.Show(); }
To highlight the list of keywords in the text summary, create the HighlightKeywords()
function as in the following code:
Private Function HighlightKeywords(ByVal TextSummary As
String, ByVal SearchPhrase As String) As String
Dim arrKeywords() As String
Dim _keyword As String
Dim _counter As Integer
arrKeywords = Regex.Split(SearchPhrase, "OR|NOT",
RegexOptions.IgnoreCase)
For _counter = 0 To UBound(arrKeywords)
_keyword = Trim(arrKeywords(_counter))
If Len(_keyword) > 0 Then
TextSummary = Regex.Replace(TextSummary, _keyword,
"<b>" & _keyword & "</b>",RegexOptions.IgnoreCase)
End If
Next _counter
Return TextSummary
End Function
When the user clicks on any of the links on this page, the WebBrowser
control will receive a Navigating
event notification. We can use this mechanism to determine which link the user clicked on by inspecting the URL of the link (which contains the desired information written by the RefreshPage()
function shown previously).
public void wbSearchSummary_Navigating(object sender, System.Windows.Forms.WebBrowserNavigatingEventArgs e) { string _temp; System.Guid _accountGUID; BaseAccount _account; string _filePath; string[] _info; BaseAccount.AccountTypes _type; //User clicks on the 'Account Name' link if (e.Url.ToString().ToLower().StartsWith ("http://accountinfo/")) { e.Cancel = true; _temp = e.Url.ToString().Substring ("http://accountinfo/".Length); _info = Regex.Split(_temp, ",", RegexOptions.IgnoreCase); if ((_info.Length - 1) == 1) { _accountGUID = new System.Guid(_info[0]); _type = (CRMLive.BaseAccount.AccountTypes) (_info[1]); _account = GlobalArea.Application.GetAccount(_accountGUID, _type); NavigationService.ShowDialog("Edit" + BaseAccount.AccountTypeToString(_type), ((object) _account)); } } //User clicks on an 'Open File' link if (e.Url.ToString().ToLower().StartsWith ("http://fileid/")) { e.Cancel = true; _filePath = e.Url.ToString().Substring ("http://fileid/".Length); if (MessageBox.Show("Are you sure you wish to open this file","Open file",MessageBoxButtons.YesNo ,MessageBoxIcon.Question, MessageBoxDefaultButton.Button1 )==DialogResult.Yes) { Process.Start(_filePath, ""); } } }
To try out the full-text search form, you will need to add another icon to the main menu form. You will also need to create a new entry in the NavigationService
class to launch the FullTextSearch
form.
Once you have this setup, try launching the full-text search form. You need to of course, index a file before you can search for it, so create a new lead account and upload a few text-based files (such as HTML files) in the file attachments tab.
Once you have done that, navigate to the full-text search form, and type in a search phrase. You can use the OR, AND
, and NOT
operators to narrow down your search. Assuming the search keywords exist in the indexed files, you will be able to see a list of the matching files.
The search engine you have created so far works, but there is certainly much room for improvement. A few ways you can further improve this search engine are as follows:
The Microsoft Word DOCX and Adobe PDF formats are popular file attachment types. Letting your users search through DOCX and PDF content can indeed provide a lot of business value. You can easily create new keyword extractors for these fi le formats by leveraging the framework you have created.
You can also build support for nested Boolean queries. A sample nested Boolean query might look like this: ((Medical OR Certificate) NOT Clinic) OR Hospital
.
By increasing the size of the stop words list, you can increase the accuracy of your search by stripping away irrelevant text. Take note, however, that increasing the size of the stop words list means your application will need to incur more processing cycles to strip away this data. You will need to fi nd the right balance between accuracy and performance for your project.
You can also easily implement a search-result ranking system roughly similar to that used by Google. By keeping track of a popularity counter for each search result, and incrementing it every time the user opens that particular item, you can sort the search results by this popularity counter in descending order to show the most popular result at the top.
52.15.112.69