Chapter 12. Searching and indexing

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12. Searching and indexing

This chapter covers

Searching databases
Indexing content using Ferret and Solr
Searching with other technologies
The scraping technique

Throughout this book, and throughout your entire programming life, you’ve been used to dealing with data, because it forms the input and output for all computer programs. So far, we have mostly looked at transforming data from one state to another, but in this chapter we’re going to investigate Ruby’s abilities to let you search through data.

Unfortunately, as a language that has reached maturity only in the last few years, Ruby is not blessed with hundreds of search-related libraries. This is not necessarily a bad thing, as search technologies progress quickly, and most of the available Ruby search solutions are up to date and ready to use in production immediately.

In this chapter, we’ll look at Ruby-specific techniques for searching and indexing data, and we’ll examine some solutions to common search-related problems.

We’re going to look at standalone libraries and techniques available to Ruby developers, and we’ll walk through the process of indexing content using two Apache Lucene-based libraries, Ferret and Solr, as well as a performance-driven Ruby-only library called FTSearch. We’ll also look at integrating search features with other technologies, and at searching the web, searching databases, and adding indexing and search features to Ruby on Rails applications.

First, we’ll define searching and indexing, as well as other related terminology.

Note

We will only cover search libraries that are under active development and that have been updated in the last year. There are several older libraries that you may hear about, but we feel it’s important to focus on up-to-date, well-supported tools, so that the level of documentation and support you would expect is still present.

12.1. The principles of searching

Searching refers to the process of taking a collection of data, processing it so that it can be scanned quickly, then enabling a program or a user to find small elements of data within the larger set. Google provides the most ubiquitous form of search technology in our lives today. If you want to find a web page that contains a certain word, or set of words, you go to http://www.google.com, type in your query, and receive your results almost instantaneously.

The amount of data Google can search across is somewhere in the range of billions of pages, which take up thousands of gigabytes of space. Google, therefore, certainly can’t scan every page, byte by byte, looking for results to your query. It could take days (or longer!) to get your search results that way, although it would still, technically, be a “search engine.”

The process that allows Google to produce search results in under a second is indexing, and the performance benefits of indexing are so significant that every search library or tool must provide indexing services of one form or another.

Indexing technologies can be extremely advanced, but at their most basic, they work in the same way as an index in a book. For example, if you turn to the index of this book and search for a particular term, you know roughly where in the index to look, because the index is in alphabetical order. Furthermore, once you find your desired term, you’re provided with a set of page numbers to refer to. This contrasts significantly with having to read through every page of the book to find something. Computers use similar techniques. A search engine’s indexer makes a note of words and phrases on a page, and links those terms (using, for performance and efficiency, a unique numeric ID that references each distinct term) to the page it is indexing. When a search is run, the query engine can quickly look up the IDs of pages that match the terms provided by a user’s search query.

12.2. Standalone and high-performance searching

In this section, we’re going to look at generic, standalone searching and indexing scenarios with the simplest problem we can provide: indexing, and then querying, a corpus of documents. This contrasts with the latter half of this chapter, where we will look at how to use and integrate search techniques in busier situations, such as on the web or within a database-driven web application.

12.2.1. Standalone indexing and search with Ferret

Ferret is a Ruby implementation of Apache Lucene, an open source search and indexing library written in Java. Lucene is incredibly popular in the open source world, and a significant amount of software, and many libraries, use it for implementing large and small search systems. Lucene is an indexing and search library, and so does not include any features relating to obtaining or specially parsing content. These features are provided by other libraries, or by the software using Lucene. As a Ruby implementation of Lucene, Ferret shares the same characteristics.

Ferret will not crawl the web for you, download emails, or index different types of content in unique ways. Instead, you have to use Ferret from your Ruby programs in a generic way. Like Lucene, Ferret can deal with different data formats, and as long as you can extract the textual content of the data you wish to index, Ferret can handle it. Ferret works with the concepts of documents and fields, where documents represent individual groups of content to be indexed (such as a single web page, an email message, or the lyrics of a song) and fields are more detailed elements of data with documents (such as dates, author information, and other metadata).

In this section, we’re going to look at using Ferret to index and search through documents we provide.

Problem

You wish to be able to index, and then search via query, an arbitrary set of documents (that may or may not contain multiple fields of metadata, such as titles, descriptions, and author information) quickly and efficiently. You do not care if the index is usable only from Ruby.

Solution

We’ll look at three solutions to the problem. The first, in listing 12.1, implements a basic text-only indexing and searching system. The second, in listing 12.2, looks at indexing and searching through content that contains metadata and multiple fields. The third solution, in listing 12.3, looks at storing an index to disk and then loading it from another program (which allows the index to persist).

Each solution assumes that you have installed the ferret gem. This is very simple to do on a system running Ruby and RubyGems; use gem install ferret.

Listing 12.1. Basic document search

Listing 12.2. Multifield document search

Listing 12.3. Separate indexer and query client programs

These three solutions are very similar, but they show different approaches and levels of complexity. All of them use a Ferret index, with the first two solutions storing the index in memory for immediate use only, and the final solution storing the index to and loading it from the disk.

In listing 12.1, we can see how simple it is to create an index by creating a new object from the Ferret::Index::Index class .

Next, we supply the index with multiple documents to be indexed . We used strings, but we could have used almost any form of data in Ruby that can translate to a string (such as an array or a hash).

Whether we use an in-memory or on-disk index, “pushing” documents to the index causes them to be indexed immediately. Finally, we query the index using the search_each method, which performs a search and iterates over each result, passing in the document ID of the matching document, along with a quality score, each time .

We did not give our documents ID numbers, but Ferret did this for us in the order that we supplied the documents. For example, the first solution performs a query of “test”, which nets the following results:

Found test in document 0 (scoring 0.70710676908493) Found test in document 1 (scoring 0.5)

Because the word “test” wasn’t mentioned in the third document (the document that has an ID of 2, due to 0-indexing), it’s not returned as a result.

Listing 12.2 shows how the previous solution can be extended with an option to include information about a set of fields that exist on the supplied documents. We define these fields first into a group using class Ferret::Index::FieldInfos, then pass that composite object through to the index as an option.

Defining fields to be indexed and managed separately by Ferret is easy. First, we create the FieldInfos object that will hold all of the information .

This constructor takes many different options, but the important ones are :store and :index. These options act as the default choices for all the fields we define from here on out. The :store option lets us choose whether the index will store the actual content of a field (or whether the content should be compressed and/or processed and then discarded entirely) and :index specifies whether a field should be indexed at all. In our case, we want the default to be yes for both.

Next, we define three fields: a “title” field, a “text” or content field, and an author field . Adding the fields is as easy as calling add_field on the FieldInfos object. Then we specify the field name, along with any options. In this case, the default options of :store => :yes and :index => :yes are used on all of the fields, but on two of the fields we provide a “boost.”

Boosting is useful when you have columns that contain data that’s more important than data in other columns. In this case, for example, we give document titles more importance in the rankings than the main content (in the text field). We then give the author name even more importance than the title, so that if one document contains “Fred” in the title, and another was written by someone named “Fred,” the latter document would probably score more highly.

Once the fields are defined in the FieldInfo set, we define the index much like in our first example, but we also pass through the field data .

Then, to add documents with defined fields, we use hashes . Because the fields are delimited in these sample documents, Ferret knows how to handle them in relation to the fields defined in the index.

If we run this example and provide it with a test query, we can see how the boosts affect the results:

What do you want to search for? > document Found 'Third Document' by 'Fred Bloggs' (score: 1.97577941417694) Found 'My Test Document' by 'Fred Bloggs' (score: 0.922030389308929) Found 'Irrelevant Title' by 'Anon' (score: 0.065859317779541)

Notice that even though each document contains the word “document,” the document that ranks highest is the one with the word in both the title and the content, whereas the second result lacks the word in the content, and the third document, with an extremely low score, merely contains it in the text field.

Listing 12.3 demonstrates how to store and retrieve an index from disk. Ferret makes it extremely simple; it is only necessary to specify the pathname within the construction of the index object .

If the directory specified using the :path parameter doesn’t exist, it’ll be created, as it is in the first example of this third solution. In the second example, the index is loaded in much a similar fashion .

Because the index should already exist, and the field information is predefined, we don’t need to construct and pass through the field information to Ferret, as it’s already part of the index’s structure. The rest of the third solution then uses the same querying code as used earlier.

Discussion

While the queries we performed in the solutions were simple, single-word queries, Ferret has support for complex queries, as you’d expect from a search tool. You can search for phrases, perform Boolean operations (“fred OR martha” or “foo AND bar”), and use wildcards (such as “fre*”).

You can learn more about the query syntax supported by Lucene at http://lucene.apache.org/java/docs/queryparsersyntax.html.

You can also learn more about how to use Ferret by looking at the official tutorial at http://ferret.davebalmain.com/api/files/TUTORIAL.html.

Next, we’re going to look at a true Apache Lucene instance, installed and made available remotely by another Apache product: Solr.

12.2.2. Integrating with the Solr search engine

In the previous section, we looked at using Ferret, a Ruby implementation of Apache Lucene, to index and search data. As we discovered, Ferret and Lucene are indexing and searching libraries, and the ability to parse the data to be indexed, as well as to interpret the results of searches, rests with the client application.

In this section, we’re going to look at Lucene from a different angle, using the Solr search server. Solr, another Apache project, is an open source search server that uses Lucene. Whereas Lucene is only a searching and indexing library, Solr provides higher-level features such as XML and JSON HTTP-based APIs, replication, caching, and a web-based administration interface. If Lucene is the guts of a searching and indexing system, Solr provides the friendly face necessary to use it at a higher level.

Whether you choose to use Ferret or Solr depends on your preferences and the fit between your requirements and the pros and cons of each technology. Even though both provide Lucene-based functionality, their interfaces are so radically different that careful consideration is required. Solr wins out if you need features like replication and the ability to rapidly scale or to easily access indexes over a network, or if you want to provide the same index to multiple applications, including non-Ruby applications. Ferret wins out if you want a simple, single-machine, Ruby-only solution, because it can be installed in one step using RubyGems, whereas Solr requires you to install several pieces of software just to get started.

The solution covered in this section expects that you have Apache Solr installed and running correctly. The installation of Solr is beyond the scope of this chapter, but the official home page is at http://lucene.apache.org/solr/, and information about its dependencies, such as Java 1.5 (or higher), and how to download Solr, is available in the main tutorial provided on the site.