Ruby on Rails integrations

There has been a lot of churn in the Ruby on Rails world for adding Solr support, with a number of competing libraries attempting to support Solr in the most Rails-native way. Rails brought to the forefront the idea of Convention over Configuration, the principle that sane defaults and simple rules should suffice in most situations versus complex configuration expressed in long XML files. The various libraries for integrating Solr in Ruby on Rails applications establish conventions in how they interact with Solr. However, there are often a lot of conventions to learn, such as suffixing String object field names with _s to match up with the dynamic field definition for String in Solr's schema.xml.

Solr's Ruby response writer

The Ruby hash structure looks very similar to the JSON data structure with some tweaks to fit Ruby, such as translating nulls to nils, using single quotes for escaping content, and the Ruby => operator to separate key/value pairs in maps. Adding a wt=ruby parameter to a standard search request, returns results that can be eval() into a Ruby hash structure like this:

{
  'responseHeader'=>{
  'status'=>0,
  'QTime'=>1,
  'params'=>{
    'wt'=>'ruby',
    'indent'=>'on',
    'rows'=>'1',
    'start'=>'0',
    'q'=>'Pete Moutso'}},
  'response'=>{'numFound'=>523,'start'=>0,'docs'=>[
    {
    'a_name'=>'Pete Moutso',
    'a_type'=>'1',
    'id'=>'Artist:371203',
    'type'=>'Artist'}]
}}

Note

Beware—running eval()has security implications!

The sunspot_rails gem

The sunspot_rails gem hooks into the lifecycle of the ActiveRecord model objects and transparently indexes them in Solr as they are created, updated, and deleted. This allows you to do queries that are backed by Solr searches, but still work with your normal ActiveRecord objects. Let's go ahead and build a small Rails application that we'll call myFaves, which allows you to store your favorite MusicBrainz artists in a relational model and also to search for them using Solr.

Sunspot comes bundled with a full install of Solr as part of the gem, which you can easily start by running rake sunspot:solr:start, running Solr on port 8982. This is great for quickly doing development since you don't need to download and set up your own Solr. Typically, you are starting with a relational database already stuffed with content that you want to make searchable. However, in our case, we already have a fully populated index of artist information, so we are actually going to take the basic artist information out of the mbartists index of Solr and populate our local myfaves database used by the Rails application. We'll then fire up the version of Solr shipped with Sunspot, and see how sunspot_rails manages the lifecycle of ActiveRecord objects to keep Solr's indexed content in sync with the content stored in the relational database. Don't worry, we'll take it step by step! The completed application is at /examples/9/myfaves for your reference.

Setting up the myFaves project

This example assumes you have Rails 3.x already installed. We'll start with the standard plumbing to get a Rails application set up with our basic data model:

>>rails new myfaves
>>cd myfaves
>>./script/generate scaffold artist name:string group_type:string release_date:datetime image_url:string
>>rake db:migrate

This generates a basic application backed by a SQLite database. Now, we need to specify that our application depends on Sunspot. Edit Gemfile and add the following code:

gem 'sunspot_rails', '~> 1.2.1'

Next, update your dependencies and generate the config/sunspot.yml configuration file:

>>bundle install
>>rails generate sunspot_rails:install

We'll also be working with roughly 399,000 artists, so obviously we'll need some page pagination to manage that list, otherwise pulling up the artists' /index listing page will timeout. We'll use the popular will_paginate gem to manage pagination. Add the will_paginate gem declaration to your Gemfile and re-run bundle install:

gem "will_paginate", "~> 3.0.pre4"

Edit the ./app/controllers/artists_controller.rb file, and replace the call to @artists = Artist.all in the index method with:

@artists = Artist.paginate :page => params[:page], :order => 'created_at DESC'

Also, add a call to the view helper at ./app/views/artists/index.html.erb to generate the page links:

<%= will_paginate @artists %>

Start the application using ./script/rails start, and visit the page http://localhost:3000/artists/. You should see an empty listing page for all of the artists. Now that we know that the basics are working, let's go ahead and actually leverage Solr.

Populating the myFaves relational database from Solr

Step one will be to import data into our relational database from the mbartists Solr index. Add the following code to ./app/models/artist.rb:

class Artist < ActiveRecord::Base
  searchable do
    text :name, :default_boost => 2
    string :group_type
    time :release_date
  end
end

The searchable block maps the attributes of the Artist ActiveRecord object to the artist fields in Solr's schema.xml. Since Sunspot is designed to store any kind of data in Solr that is stored in your database, it needs a way of distinguishing among various types of data model objects. For example, if we wanted to store information about our User model object in Solr, in addition to the Artist object, then we would need to provide a field in the schema to distinguish the Solr document for the artist with the primary key of 5 from the Solr document for the user, which also has the primary key of 5. Fortunately, the mbartists schema has a field named type that stores the value Artist, which maps directly to our ActiveRecord class name of Artist.

There is a simple script called populate.rb at the root of /examples/9/myfaves that you can run, which will copy the artist data from the existing Solr mbartists index into the myFaves database:

>>./populate.rb

The populate.rb is a great example of the types of scripts you may need to develop to transfer data in and out of Solr. Most scripts typically work with some sort of batch size of records that are pulled from one system and then inserted into Solr. The larger the batch size, the more efficient the pulling and processing of data typically is at the cost of more memory being consumed, and the slower the commit and optimize operations are. When you run the populate.rb script, play with the batch size parameter to get a sense of resource consumption in your environment. Try a batch size of 10 versus 10000 to see the changes. The parameters for populate.rb are available at the top of the script:

MBARTISTS_SOLR_URL = 'http://localhost:8983/solr/mbartists'
BATCH_SIZE = 1500
MAX_RECORDS = 100000

There are roughly 399,000 artists in the mbartists index, so if you are impatient, then you can set MAX_RECORDS to a more reasonable number to complete running the script faster.

The connection to Solr is handled by the RSolr library. A request to Solr is simply a hash of parameters that is passed as part of the GET request. We use the *:* query to find all of the artists in the index and then iterate through the results using the start parameter:

rsolr = RSolr.connect :url => MBARTISTS_SOLR_URL
response = rsolr.select({
:q => '*:*',
:rows=> BATCH_SIZE, 
:start => offset, 
:fl => ['*','score']
})

In order to create our new Artist model objects, we just iterate through the results of response['response']['docs'], parsing each document in order to preserve our unique identifiers between Solr and the database and creating new ActiveRecord objects. In our MusicBrainz Solr schema, the ID field functions as the primary key and looks like Artist:11650 for The Smashing Pumpkins. In the database, in order to sync the two, we need to insert the Artist with the ID of 11650. We wrap the insert statement a.save! in a begin/rescue/end structure so that if we've already inserted an artist with a primary key, then the script continues. This allows us to run the populate script multiple times without erroring out:

response['response']['docs'].each do |doc|
  id = doc["id"]
  id = id[7..(id.length)]
  a = Artist.new(
  :id => id,
  :name => doc["a_name"], 
  :group_type => doc["a_type"], 
  :release_date => doc["a_release_date_latest"]
  
  begin
    a.save!
  rescue ActiveRecord::StatementInvalid => err
    raise err unless err.to_s.include?("PRIMARY KEY must be unique") # sink duplicates
  end
end

We've successfully migrated the data we need for our myFaves application out of Solr and we're ready to use the version of Solr that's bundled with Sunspot.

Solr configuration information is listed in ./myfaves/config/sunspot.yml. Sunspot establishes the convention that development is on port 8982, unit tests that use Solr connect on port 8981, and then production connects on the traditional 8983 port:

development:
  solr:
    hostname: localhost
    port: 8982

Start the included Solr by running rake sunspot:solr:start. To shut down Solr, run the corresponding rake command, sunspot:solr:stop. On the initial startup, rake will create a new top level ./solr directory and populate the conf directory with default configuration files for Solr (including schema.xml, stopwords.txt, and so on) pulled from the Sunspot gem.

Building Solr indexes from a relational database

Now, we are ready to trigger a full index of the data from the relational database into Solr. sunspot provides a very convenient rake task for this with a variety of parameters that you can learn about by running rake -D sunspot:reindex:

>>rake sunspot:solr:start
>>rake sunspot:reindex

Browse to http://localhost:8982/solr/admin/schema.jsp to see the list of dynamic fields generated by following the Convention over Configuration pattern of Rails applied to Solr. Some of the conventions that are established by Sunspot and expressed by Solr in ./solr/conf/schema.xml are as follows:

  • The primary key field of the model object in Solr is always called id.
  • The type field that stores the disambiguating class name of the model object is called type.
  • Heavy use of the dynamic field support in Solr. The data type of ActiveRecord model objects is based on the database column type. Therefore, when sunspot_rails indexes a model object, it sends a document to Solr with the various suffixes to leverage the dynamic column creation. In ./solr/conf/schema.xml, the only fields defined outside of the management fields are dynamic fields:
    <dynamicField name="*_text" type="text" indexed="true" stored="false"/>
  • The default search field is called text. However, you need to define what fields are copied into the text field. Sunspot's DSL is oriented towards naming each model field you'd like to search from Ruby.

The document that gets sent to Solr for our Artist records creates the dynamic fields such as name_text, group_type_s and release_date_d, for a text, string, and date field, respectively. You can see the list of dynamic fields generated through the schema browser at http://localhost:8982/solr/admin/schema.jsp.

Now we are ready to perform some searches. Sunspot adds some new methods to our ActiveRecord model objects such as search() that lets us load ActiveRecord model objects by sending a query to Solr. Here we find the group Smash Mouth by searching for matches to the word smashing:

% ./script/rails console
Loading development environment (Rails 3.0.9)
>>search= Artist.search{keywords "smashing"}
=><Sunspot::Search:{:fq=>["type:Artist"], :q=>"smashing",
:fl=>"* score", :qf=>"name_text^2", :defType=>"dismax", :start=>0, :rows=>30}>
>>search.results.first
=>[#<Artist id: 93855, name: "Smashing Atoms", group_type: nil, release_date: nil, image_url: nil, created_at: "2011-07-21 05:15:21", updated_at: "2011-07-21 05:15:21">]

The raw results from Solr are stored in the search.hits variable. The search.results variable returns the ActiveRecord objects from the database.

Let's also verify that Sunspot is managing the full lifecycle of our objects. Assuming Susan Boyle isn't yet entered as an artist; let's go ahead and create her:

>>Artist.search{keywords  'Susan Boyle', :fields => [:name]}.hits
=>[]
>>susan = Artist.create(:name => "Susan Boyle", :group_type =>'1', :release_date => Date.new)
=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2011-07-22 21:05:53"", updated_at: "2011-07-22 21:05:53"">

Check the log output from your Solr running on port 8982, and you should also have seen an update query triggered by the insert of the new Susan Boyle record:

INFO: [] webapp=/solr path=/update params={} status=0 QTime=24 

Now, delete Susan's record from your database:

>>susan.destroy
=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09">

As a result, there should be another corresponding update issued to Solr to remove the document:

INFO: [] webapp=/solr path=/update params={} status=0 QTime=57 

You can verify this by doing a search for Susan Boyle directly, which should return no rows at http://localhost:8982/solr/select/?q=Susan+Boyle.

Completing the myFaves website

Now, let's go ahead and put in the rest of the logic for using our Solr-ized model objects to simplify finding our favorite artists. We'll store the list of favorite artists in the browser's session space for convenience. If you are following along with your own generated version of the myFaves application, then the remaining files you'll want to copy over from /examples/9/myfaves are as follows:

  • ./app/controller/myfaves_controller.rb: This contains the controller logic for picking your favorite artists.
  • ./app/views/myfaves/: This contains the display files for picking and showing the artists.
  • ./app/views/layouts/myfaves.html.erb: This is the layout of the myFaves views. We use the Autocomplete widget again so that this layout embeds the appropriate JavaScript and CSS files.
  • ./public/stylesheets/jquery.autocomplete.css and ./public/stylesheets/indicator.gif: They are stored locally in order to fix pathing issues with the indicator.gif showing up when the autocompletion search is running.

The only other edits you need to make are:

  • Edit ./config/routes.rb by adding resources :myfaves and root :to => "myfaves#index".
  • Delete ./public/index.html to use the new root route you just defined.
  • Copy the index method out of ./app/controllers/artists_controllers.rb because we want the index method to respond with both HTML and JSON response types.
  • Run rake db:sessions:create to generate a sessions table, then rake db:migrate to update the database with the new sessions table. Edit ./config/initializers/session_store.rb and change to using :active_record_store for preserving the session state.

You should now be able to run ./script/rails start and browse to http://localhost:3000/. You will be prompted to enter the search by entering the artist's name. If you don't receive any results, then make sure you have started Solr using rake sunspot:solr:start. Also, if you have only loaded a subset of the full 399,000 artists, then your choices may be limited. You can load all of the artists through the populate.rb script and then run rake sunspot:reindex, although it will take a long time to complete. This is something good to do just before you head out for lunch or home for the evening!

If you look at ./app/views/myfaves/index.rhtml, then you can see that the jQuery autocomplete call is a bit different:

$("#artist_name").autocomplete( '/artists.json?callback=?', {

The URL we are hitting is /artists.json, with the .json suffix telling Rails that we want the JSON data back instead of normal HTML. If we ended the URL with .xml, then we would have received XML-formatted data about the artists. We provide a slightly different parameter to Rails to specify the JSONP callback to use. Unlike the previous example, where we used json.wrf, which is Solr's parameter name for the callback method to call, we use the more standard parameter name callback. We changed the ArtistController index method to handle the autocomplete widget's data needs through JSONP. If there is a q parameter, then we know that the request was from the autocomplete widget, and we ask Solr for @artists to respond with. Later on, we render @artists into JSON objects, returning only the name and id attributes to keep the payload small.

We also specify that the JSONP callback method is what was passed when using the callback parameter:

def index
if params[:q]
   @artists = Artist.search{ keywords params[:q]}.results
else
   @artists = Artist.paginate :page => params[:page], :order => 'created_at DESC'
end

respond_to do |format|
format.html # index.html.erb
format.xml { render :xml => @artists }
format.json { render :json => @artists.to_json(:only => [:name,         :id]), :callback => params[:callback] }
end
end

At the end of all of this, you should have a nice autocomplete interface for quickly picking artists.

When you are selecting Sunspot as your integration method, you are implicitly agreeing to the various conventions established for indexing data into Solr. If you are used to working with Solr directly, you may find understanding the Sunspot DSL for querying a bit of an obstacle. However, if your background is in Rails, or you are building very complex queries, then learning the DSL will pay off in productivity and the ability to maintain complex expressive queries.

Which Rails/Ruby library should I use?

The two most common high-level libraries for interacting with Solr are acts_as_solr and Sunspot. However, in the last couple of years, Sunspot has become the more popular choice, and comes in a version designed to work explicitly with Rails called sunspot_rails that allows Rails' ActiveRecord database objects to be transparently backed by a Solr index for full text search.

For lower-level client interface to Solr from Ruby environments, there are two libraries duking it out to be the client of choice: solr-ruby, a client library developed by the Apache Solr project and rsolr, which is a reimplementation of a Ruby-centric client library. Both of these solutions are solid and act as great low-level API libraries. However, rsolr has gained more attention, has better documentation, and some nice features such as a direct embedded Solr connection through JRuby. rsolr also has support for using curb (Ruby bindings to curl, a very fast HTTP library) instead of the standard Net::HTTP library for the HTTP transport layer.

In order to perform a select using solr-ruby, you need to issue the following code:

response = solr.query('washington', {
:start=>0,
:rows=>10
 })

In order to perform a select using rsolr, you need to issue the following code:

response = solr.select({
:q=>'washington',
:start=>0,
:rows=>10
 })

So you can see that doing a basic search is pretty much the same in either library. Differences crop up more as you dig into the details on parsing and indexing records. You can learn more about solr-ruby on the Solr wiki at http://wiki.apache.org/solr/solr-ruby and learn more about rsolr at http://github.com/mwmitchell/rsolr/.

Tip

Think whether you really need another layer of abstraction between you and Solr. Making a call to Solr using wt=ruby and evaluating the results may be the simplest solution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.121.131