0%

Book Description

Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more

  • Comprehensive information on Apache Solr 3 with examples and tips so you can focus on the important parts
  • Integration examples with databases, web-crawlers, XSLT, Java & embedded-Solr, PHP & Drupal, JavaScript, Ruby frameworks
  • Advice on data modeling, deployment considerations to include security, logging, and monitoring, and advice on scaling Solr and measuring performance
  • An update of the best-selling title on Solr 1.4

In Detail

If you are a developer building an app today then you know how important a good search experience is. Apache Solr, built on Apache Lucene, is a wildly popular open source enterprise search server that easily delivers powerful search and faceted navigation features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-check, relevancy tuning, and more.

Apache Solr 3 Enterprise Search Server is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate Solr with other languages and frameworks.

Through using a large set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project, you will have a testing ground for Solr, and will learn how to import this data in various ways. You will then learn how to search this data in different ways, including Solr's rich query syntax and ""boosting"" match scores based on record data.

Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.

Table of Contents

  1. Apache Solr 3 Enterprise Search Server
    1. Apache Solr 3 Enterprise Search Server
    2. Credits
    3. About the Authors
    4. Acknowledgement
    5. Acknowledgement
    6. About the Reviewers
    7. www.PacktPub.com
      1. Discounts
      2. Free eBooks
      3. Newsletters
      4. Code Downloads, Errata and Support
    8. PacktLib.PacktPub.com
    9. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    10. 1. Quick Starting Solr
      1. An introduction to Solr
        1. Lucene, the underlying engine
        2. Solr, a Lucene-based search server
        3. Comparison to database technology
      2. Getting started
        1. Solr's installation directory structure
        2. Solr's home directory and Solr cores
        3. Running Solr
      3. A quick tour of Solr
        1. Loading sample data
        2. A simple query
        3. Some statistics
        4. The sample browse interface
      4. Configuration files
      5. Resources outside this book
      6. Summary
    11. 2. Schema and Text Analysis
      1. MusicBrainz.org
      2. One combined index or separate indices
        1. One combined index
          1. Problems with using a single combined index
        2. Separate indices
      3. Schema design
        1. Step 1: Determine which searches are going to be powered by Solr
        2. Step 2: Determine the entities returned from each search
        3. Step 3: Denormalize related data
          1. Denormalizing—'one-to-one' associated data
          2. Denormalizing—'one-to-many' associated data
        4. Step 4: (Optional) Omit the inclusion of fields only used in search results
      4. The schema.xml file
        1. Defining field types
        2. Built-in field type classes
          1. Numbers and dates
          2. Geospatial
        3. Field options
        4. Field definitions
          1. Dynamic field definitions
        5. Our MusicBrainz field definitions
        6. Copying fields
        7. The unique key
        8. The default search field and query operator
      5. Text analysis
        1. Configuration
        2. Experimenting with text analysis
        3. Character filters
        4. Tokenization
        5. WordDelimiterFilter
        6. Stemming
          1. Correcting and augmenting stemming
        7. Synonyms
          1. Index-time versus query-time, and to expand or not
        8. Stop words
        9. Phonetic sounds-like analysis
        10. Substring indexing and wildcards
          1. ReversedWildcardFilter
          2. N-grams
          3. N-gram costs
        11. Sorting Text
        12. Miscellaneous token filters
      6. Summary
    12. 3. Indexing Data
      1. Communicating with Solr
        1. Direct HTTP or a convenient client API
        2. Push data to Solr or have Solr pull it
        3. Data formats
        4. HTTP POSTing options to Solr
        5. Remote streaming
      2. Solr's Update-XML format
        1. Deleting documents
      3. Commit, optimize, and rollback
      4. Sending CSV formatted data to Solr
        1. Configuration options
      5. The Data Import Handler Framework
        1. Setup
        2. The development console
        3. Writing a DIH configuration file
          1. Data Sources
          2. Entity processors
          3. Fields and transformers
        4. Example DIH configurations
          1. Importing from databases
          2. Importing XML from a file with XSLT
          3. Importing multiple rich document files (crawling)
        5. Importing commands
          1. Delta imports
      6. Indexing documents with Solr Cell
        1. Extracting text and metadata from files
        2. Configuring Solr
        3. Solr Cell parameters
        4. Extracting karaoke lyrics
        5. Indexing richer documents
      7. Update request processors
      8. Summary
    13. 4. Searching
      1. Your first search, a walk-through
      2. Solr's generic XML structured data representation
      3. Solr's XML response format
        1. Parsing the URL
      4. Request handlers
      5. Query parameters
        1. Search criteria related parameters
        2. Result pagination related parameters
        3. Output related parameters
        4. Diagnostic related parameters
      6. Query parsers and local-params
      7. Query syntax (the lucene query parser)
        1. Matching all the documents
        2. Mandatory, prohibited, and optional clauses
          1. Boolean operators
        3. Sub-queries
          1. Limitations of prohibited clauses in sub-queries
        4. Field qualifier
        5. Phrase queries and term proximity
        6. Wildcard queries
          1. Fuzzy queries
        7. Range queries
          1. Date math
        8. Score boosting
        9. Existence (and non-existence) queries
        10. Escaping special characters
      8. The Dismax query parser (part 1)
        1. Searching multiple fields
        2. Limited query syntax
        3. Min-should-match
          1. Basic rules
          2. Multiple rules
          3. What to choose
        4. A default search
      9. Filtering
      10. Sorting
      11. Geospatial search
        1. Indexing locations
        2. Filtering by distance
        3. Sorting by distance
      12. Summary
    14. 5. Search Relevancy
      1. Scoring
        1. Query-time and index-time boosting
        2. Troubleshooting queries and scoring
      2. Dismax query parser (part 2)
        1. Lucene's DisjunctionMaxQuery
        2. Boosting: Automatic phrase boosting
          1. Configuring automatic phrase boosting
          2. Phrase slop configuration
          3. Partial phrase boosting
        3. Boosting: Boost queries
        4. Boosting: Boost functions
          1. Add or multiply boosts?
      3. Function queries
        1. Field references
        2. Function reference
          1. Mathematical primitives
          2. Other math
          3. ord and rord
          4. Miscellaneous functions
        3. Function query boosting
          1. Formula: Logarithm
          2. Formula: Inverse reciprocal
          3. Formula: Reciprocal
          4. Formula: Linear
        4. How to boost based on an increasing numeric field
          1. Step by step…
          2. External field values
        5. How to boost based on recent dates
          1. Step by step…
      4. Summary
    15. 6. Faceting
      1. A quick example: Faceting release types
        1. MusicBrainz schema changes
      2. Field requirements
      3. Types of faceting
      4. Faceting field values
        1. Alphabetic range bucketing
      5. Faceting numeric and date ranges
        1. Range facet parameters
      6. Facet queries
      7. Building a filter query from a facet
        1. Field value filter queries
        2. Facet range filter queries
      8. Excluding filters (multi-select faceting)
      9. Hierarchical faceting
      10. Summary
    16. 7. Search Components
      1. About components
      2. The Highlight component
        1. A highlighting example
        2. Highlighting configuration
          1. The regex fragmenter
          2. The fast vector highlighter with multi-colored highlighting
      3. The SpellCheck component
        1. Schema configuration
        2. Configuration in solrconfig.xml
          1. Configuring spellcheckers (dictionaries)
            1. IndexBasedSpellChecker options
            2. FileBasedSpellChecker options
          2. Processing of the q parameter
          3. Processing of the spellcheck.q parameter
        3. Building the dictionary from its source
        4. Issuing spellcheck requests
        5. Example usage for a misspelled query
      4. Query complete / suggest
        1. Query term completion via facet.prefix
        2. Query term completion via the Suggester
        3. Query term completion via the Terms component
      5. The QueryElevation component
        1. Configuration
      6. The MoreLikeThis component
        1. Configuration parameters
          1. Parameters specific to the MLT search component
          2. Parameters specific to the MLT request handler
          3. Common MLT parameters
        2. MLT results example
      7. The Stats component
        1. Configuring the stats component
        2. Statistics on track durations
      8. The Clustering component
      9. Result grouping/Field collapsing
        1. Configuring result grouping
      10. The TermVector component
      11. Summary
    17. 8. Deployment
      1. Deployment methodology for Solr
        1. Questions to ask
      2. Installing Solr into a Servlet container
        1. Differences between Servlet containers
          1. Defining solr.home property
      3. Logging
        1. HTTP server request access logs
        2. Solr application logging
          1. Configuring logging output
          2. Logging using Log4j
          3. Jetty startup integration
          4. Managing log levels at runtime
      4. A SearchHandler per search interface?
      5. Leveraging Solr cores
        1. Configuring solr.xml
          1. Property substitution
          2. Include fragments of XML with XInclude
        2. Managing cores
        3. Why use multicore?
      6. Monitoring Solr performance
        1. Stats.jsp
        2. JMX
          1. Starting Solr with JMX
            1. Take a walk on the wild side! Use JRuby to extract JMX information
      7. Securing Solr from prying eyes
        1. Limiting server access
          1. Securing public searches
          2. Controlling JMX access
        2. Securing index data
          1. Controlling document access
          2. Other things to look at
      8. Summary
    18. 9. Integrating Solr
      1. Working with included examples
        1. Inventory of examples
      2. Solritas, the integrated search UI
        1. Pros and Cons of Solritas
      3. SolrJ: Simple Java interface
        1. Using Heritrix to download artist pages
        2. SolrJ-based client for Indexing HTML
        3. SolrJ client API
          1. Embedding Solr
          2. Searching with SolrJ
          3. Indexing
            1. Indexing POJOs
        4. When should I use embedded Solr?
          1. In-process indexing
          2. Standalone desktop applications
          3. Upgrading from legacy Lucene
      4. Using JavaScript with Solr
        1. Wait, what about security?
        2. Building a Solr powered artists autocomplete widget with jQuery and JSONP
        3. AJAX Solr
      5. Using XSLT to expose Solr via OpenSearch
        1. OpenSearch based Browse plugin
          1. Installing the Search MBArtists plugin
      6. Accessing Solr from PHP applications
        1. solr-php-client
        2. Drupal options
          1. Apache Solr Search integration module
          2. Hosted Solr by Acquia
      7. Ruby on Rails integrations
        1. The Ruby query response writer
        2. sunspot_rails gem
          1. Setting up MyFaves project
          2. Populating MyFaves relational database from Solr
          3. Build Solr indexes from a relational database
          4. Complete MyFaves website
        3. Which Rails/Ruby library should I use?
      8. Nutch for crawling web pages
      9. Maintaining document security with ManifoldCF
        1. Connectors
        2. Putting ManifoldCF to use
      10. Summary
    19. 10. Scaling Solr
      1. Tuning complex systems
      2. Testing Solr performance with SolrMeter
      3. Optimizing a single Solr server (Scale up)
        1. Configuring JVM settings to improve memory usage
          1. MMapDirectoryFactory to leverage additional virtual memory
        2. Enabling downstream HTTP caching
        3. Solr caching
          1. Tuning caches
        4. Indexing performance
          1. Designing the schema
          2. Sending data to Solr in bulk
          3. Don't overlap commits
          4. Disabling unique key checking
          5. Index optimization factors
        5. Enhancing faceting performance
        6. Using term vectors
        7. Improving phrase search performance
      4. Moving to multiple Solr servers (Scale horizontally)
        1. Replication
        2. Starting multiple Solr servers
          1. Configuring replication
        3. Load balancing searches across slaves
          1. Indexing into the master server
          2. Configuring slaves
        4. Configuring load balancing
        5. Sharding indexes
          1. Assigning documents to shards
          2. Searching across shards (distributed search)
      5. Combining replication and sharding (Scale deep)
        1. Near real time search
          1. Near real time search
      6. Where next for scaling Solr?
      7. Summary
    20. A. Search Quick Reference
      1. Quick reference
3.137.187.233