0%

Book Description

Where do you start with Apache Soir? We’d suggest with this book, which assumes no prior knowledge and takes you step by careful step through all the essentials, putting you on the road towards successful implementation.

  • Learn to use Solr in real-world contexts, even if you are not a programmer, using simple configuration examples
  • Define simple configurations for searching data in several ways in your specific context, from suggestions to advanced faceted navigation
  • Teaches you in an easy-to-follow style, full of examples, illustrations, and tips to suit the demands of beginners

In Detail

With over 40 billion web pages, the importance of optimizing a search engine's performance is essential.

Solr is an open source enterprise search platform from the Apache Lucene project. Full-text search, faceted search, hit highlighting, dynamic clustering, database integration, and rich document handling are just some of its many features. Solr is highly scalable thanks to its distributed search and index replication.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable with most popular programming languages. Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has a plugin architecture to support more advanced customization.

With Apache Solr Beginner's Guide you will learn how to configure your own search engine experience. Using real data as an example, you will have the chance to start writing step-by-step, simple, real-world configurations and understand when and where to adopt this technology.

Apache Solr Beginner's Guide will start by letting you explore a simple search over real data. You will then go through a step-by-step description that gives you the chance to explore several practical features. At the end of the book you will see how Solr is used in different real-world contexts.

Using data from public domains like DBpedia, you will define several different configurations, exploring some of the most interesting Solr features, such as faceted search and navigation, auto-suggestion, and rich document indexing. You will see how to configure different analysers for handling different data types, without programming.

You will learn the basics of Solr, focusing on real-world examples and practical configurations.

Table of Contents

  1. Apache Solr Beginner's Guide
    1. Table of Contents
    2. Apache Solr Beginner's Guide
    3. Credits
    4. About the Author
    5. Acknowledgments
    6. About the Reviewers
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Time for action – heading
        1. What just happened?
        2. Pop quiz – heading
      6. Reader feedback
      7. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    9. 1. Getting Ready with the Essentials
      1. Understanding Solr
      2. Learning the powerful aspects of Solr
      3. Working with Java installation
        1. Downloading and installing Java
        2. Configuring CLASSPATH and PATH variables for Java
      4. Installing and testing Solr
      5. Time for action – starting Solr for the first time
        1. What just happened?
        2. Taking a glance at the Solr interface
      6. Time for action – posting some example data
        1. What just happened?
      7. Time for action – testing Solr with cURL
        1. What just happened?
      8. Who uses Solr?
      9. Resources on Solr
      10. How will we use Solr?
        1. Pop quiz
      11. Summary
    10. 2. Indexing with Local PDF Files
      1. Understanding and using an index
      2. Posting example documents to the first Solr core
        1. Analyzing the elements we need in Solr core
      3. Time for action – configuring Solr Home and Solr core discovery
        1. What just happened?
        2. Knowing the legacy solr.xml format
      4. Time for action – writing a simple solrconfig.xml file
        1. What just happened?
      5. Time for action – writing a simple schema.xml file
        1. What just happened?
      6. Time for action – starting the new core
        1. What just happened?
      7. Time for action – defining an example document
        1. What just happened?
      8. Time for action – indexing an example document with cURL
        1. What just happened?
        2. Executing the first search on the new core
        3. Adding documents to the index from the web UI
      9. Time for action – updating an existing document
        1. What just happened?
      10. Time for action – cleaning an index
        1. What just happened?
      11. Creating an index prototype from PDF files
      12. Time for action – defining the schema.xml file with only dynamic fields and tokenization
        1. What just happened?
      13. Time for action – writing a simple solrconfig.xml file with an update handler
        1. What just happened?
        2. Testing the PDF file core with dummy data and an example query
        3. Defining a new tokenized field for fulltext
      14. Time for action – using Tika and cURL to extract text from PDFs
        1. What just happened?
        2. Using cURL to index some PDF data
      15. Time for action – finding copies of the same files with deduplication
        1. What just happened?
      16. Time for action – looking inside an index with SimpleTextCodec
        1. What just happened?
      17. Understanding the structure of an inverted index
        1. Understanding how optimization affects the segments of an index
      18. Writing the full configuration for our PDF index example
        1. Writing the solrconfig.xml file
        2. Writing the schema.xml file
      19. Summarizing some easy recipes for the maintenance of an index
        1. Pop quiz
      20. Summary
    11. 3. Indexing Example Data from DBpedia – Paintings
      1. Harvesting paintings' data from DBpedia
      2. Analyzing the entities that we want to index
        1. Analyzing the first entity – Painting
      3. Writing Solr core configurations for the first tests
      4. Time for action – defining the basic solrconfig.xml file
        1. What just happened?
        2. Looking at the differences between commits and soft commits
      5. Time for action – defining the simple schema.xml file
        1. What just happened?
        2. Introducing analyzers, tokenizers, and filters
        3. Thinking fields for atomic updates
        4. Indexing a test entity with JSON
        5. Understanding the update chain
        6. Using the atomic update
        7. Understanding how optimistic concurrency works
      6. Time for action – listing all the fields with the CSV output
        1. What just happened?
      7. Defining a new Solr core for our Painting entity
      8. Time for action – refactoring the schema.xml file for the paintings core by introducing tokenization and stop words
        1. What just happened?
        2. Using common field attributes for different use cases
        3. Testing the paintings schema
      9. Collecting the paintings data from DBpedia
        1. Downloading data using the DBpedia SPARQL endpoint
        2. Creating Solr documents for example data
        3. Indexing example data
      10. Testing our paintings core
      11. Time for action - looking at a field using the Schema browser in the web interface
        1. What just happened?
      12. Time for action – searching the new data in the paintings core
        1. What just happened?
        2. Using the Solr web interface for simple maintenance tasks
        3. Pop quiz
      13. Summary
    12. 4. Searching the Example Data
      1. Looking at Solr's standard query parameters
        1. Adding a timestamp field for tracking the last modified time
        2. Sending Solr's query parameters over HTTP
          1. Testing HTTP parameters on browsers
        3. Choosing a format for the output
      2. Time for action – searching for all documents with pagination
        1. What just happened?
      3. Time for action – projecting fields with fl
        1. What just happened?
        2. Introducing pseudo-fields and DocTransformers
          1. Adding a constant field using transformers
      4. Time for action – adding a custom DocTransformer to hide empty fields in the results
        1. What just happened?
        2. Looking at core parameters for queries
        3. Using the Lucene query parser with defType
      5. Time for action – searching for terms with a Boolean query
        1. What just happened?
      6. Time for action – using q.op for the default Boolean operator
        1. What just happened?
      7. Time for action – selecting documents with the filter query
        1. What just happened?
      8. Time for action – searching for incomplete terms with the wildcard query
        1. What just happened?
      9. Time for action – using the Boost options
        1. What just happened?
          1. Understanding the basic Lucene score
      10. Time for action – searching for similar terms with fuzzy search
        1. What just happened?
      11. Time for action – writing a simple phrase query example
        1. What just happened?
      12. Time for action – playing with range queries
        1. What just happened?
      13. Time for action – sorting documents with the sort parameter
        1. What just happened?
          1. Playing with the request
      14. Time for action – adding a default parameter to a handler
        1. What just happened?
          1. Playing with the response
        2. Summarizing the parameters that affect result presentation
        3. Analyzing response format
      15. Time for action – enabling XSLT Response Writer with Luke
        1. What just happened?
        2. Listing all fields names with CSV output
        3. Listing all field details for a core
        4. Exploring Solr for Open Data publishing
          1. Publishing results in CSV format
          2. Publishing results with an RSS feed
        5. Good resources on Solr Query Syntax
        6. Pop quiz
      16. Summary
    13. 5. Extending Search
      1. Looking at different search parsers – Lucene, Dismax, and Edismax
        1. Starting from the previous core definition
      2. Time for action – inspecting results using the stats and debug components
        1. What just happened?
        2. Looking at Lucene and Solr query parsers
      3. Time for action – debugging a query with the Lucene parser
        1. What just happened?
      4. Time for action – debugging a query with the Dismax parser
        1. What just happened?
        2. Using an Edismax default handler
      5. Time for action – executing a nested Edismax query
        1. What just happened?
      6. A short list of search components
        1. Adding the blooming filter and real-time Get
      7. Time for action – executing a simple pseudo-join query
        1. What just happened?
        2. Highlighting results to improve the search experience
      8. Time for action – generating highlighted snippets over a term
        1. What just happened?
      9. Some idea about geolocalization with Solr
      10. Time for action – creating a repository of cities
        1. What just happened?
        2. Playing more with spatial search
        3. Looking at the new Solr 4 spatial features – from points to polygons
      11. Time for action – expanding the original data with coordinates during the update process
        1. What just happened?
      12. Performing editorial correction on boosting
      13. Introducing the spellcheck component
      14. Time for action – playing with spellchecks
        1. What just happened?
        2. Using a file to spellcheck against a list of controlled words
        3. Collecting some hints for spellchecking analysis
        4. Pop quiz
      15. Summary
    14. 6. Using Faceted Search – from Searching to Finding
      1. Exploring documents suggestion and matching with faceted search
      2. Time for action – prototyping an auto-suggester with facets
        1. What just happened?
      3. Time for action – creating wordclouds on facets to view and analyze data
        1. What just happened?
      4. Thinking about faceted search and findability
        1. Faceting for narrowing searches and exploring data
      5. Time for action – defining facets over enumerated fields
        1. What just happened?
      6. Performing data normalization for the keyword field during the update phase
        1. Reading more about Solr faceting parameters
      7. Time for action – finding interesting topics using faceting on tokenized fields with a filter query
        1. What just happened?
      8. Using filter queries for caching filters
      9. Time for action – finding interesting subjects using a facet query
        1. What just happened?
      10. Time for action – using range queries and facet range queries
        1. What just happened?
      11. Time for action – using a hierarchical facet (pivot)
        1. What just happened?
      12. Introducing group and field collapsing
      13. Time for action – grouping results
        1. What just happened?
      14. Playing with terms
      15. Time for action – playing with a term suggester
        1. What just happened?
        2. Thinking about term vectors and similarity
          1. Moving to semantics with vector space models
          2. Looking at the next step – customizing similarity
      16. Time for action – having a look at the term vectors
        1. What just happened?
        2. Reading about functions
      17. Introducing the More Like This component and recommendations
      18. Time for action – obtaining similar documents by More Like This
        1. What just happened?
          1. Adopting a More Like This handler
        2. Pop quiz
      19. Summary
    15. 7. Working with Multiple Entities, Multicores, and Distributed Search
      1. Working with multiple entities
      2. Time for action – searching for cities using multiple core joins
        1. What just happened?
        2. Preparing example data for multiple entities
          1. Downloading files for multiple entities
          2. Generating Solr documents
        3. Playing with joins on multicores (a core for every entity)
      3. Using sharding for distributed search
      4. Time for action – playing with sharding (distributed search)
        1. What just happened?
      5. Time for action – finding a document from any shard
        1. What just happened?
      6. Collecting some ideas on schemaless versus normalization
        1. Creating a single denormalized index
          1. Adding a field to track entity type
          2. Analyzing, designing, and refactoring our domain
          3. Using document clustering as a domain analysis tool
          4. Managing index replication
          5. Clustering Solr for distributed search using SolrCloud
        2. Taking a journey from single core to SolrCloud
        3. Understanding why we need Zookeeper
      7. Time for action – testing SolrCloud and Zookeeper locally
        1. What just happened?
        2. Looking at the suggested configurations for SolrCloud
          1. Changing the schema.xml file
          2. Changing the solrconfig.xml file
        3. Knowing the pros and cons of SolrCloud
        4. Pop quiz
      8. Summary
    16. 8. Indexing External Data sources
      1. Stepping further into the real world
        1. Collecting example data from the Web Gallery of Art site
      2. Time for action – indexing data from a database (for example, a blog or an e-commerce website)
        1. What just happened?
      3. Time for action – handling sub-entities (for example, joins on complex data)
        1. What just happened?
      4. Time for action – indexing incrementally using delta imports
        1. What just happened?
      5. Time for action – indexing CSV (for example, open data)
        1. What just happened?
      6. Time for action – importing Solr XML document files
        1. What just happened?
        2. Importing data from another Solr instance
        3. Indexing emails
      7. Time for action – indexing rich documents (for example, PDF)
        1. What just happened?
      8. Adding more consideration about tuning
        1. Understanding Java Virtual Machine, threads, and Solr
        2. Choosing the correct directory for implementation
        3. Adopting Solr cache
      9. Time for action – indexing artist data from Tate Gallery and DBpedia
        1. What just happened?
        2. Using DataImportHandler
        3. Pop quiz
      10. Summary
    17. 9. Introducing Customizations
      1. Looking at the Solr customizations
        1. Adding some more details to the core discovery
      2. Playing with specific languages
      3. Time for action – detecting language with Tika and LangDetect
        1. What just happened?
      4. Introducing stemming for query expansion
      5. Time for action – adopting a stemmer
        1. What just happened?
        2. Testing language analysis with JUnit and Scala
        3. Writing new Solr plugins
        4. Introducing Solr plugin structure and lifecycle
        5. Implementing interfaces for obtaining information
      6. Following an example plugin lifecycle
      7. Time for action – writing a new ResponseWriter plugin with the Thymeleaf library
        1. What just happened?
      8. Using Maven for development
      9. Time for action – integrating Stanford NER for Named Entity extraction
        1. What just happened?
        2. Pointing ideas for Solr's customizations
        3. Pop quiz
      10. Summary
    18. A. Solr Clients and Integrations
      1. Introducing SolrJ – an embedded or remote Solr client using the Java (JVM) API
      2. Time for action – playing with an embedded Solr instance
        1. What just happened?
      3. Choosing between an embedded or remote Solr instance
      4. Time for action – playing with an external HttpSolrServer
        1. What just happened?
      5. Time for action – using Bean Scripting Framework and JavaScript
        1. What just happened?
        2. Jahia CMS
        3. Magnolia CMS
        4. Alfresco DMS and CMS
        5. Liferay
        6. Broadleaf
        7. Apache Jena
        8. Solr Groovy or the Grails plugin
        9. Solr scala
        10. Spring data
      6. Writing Solr clients and integrations outside JVM
        1. JavaScript
          1. Taking a glance at ajax-solr, solrstrap, facetview, and jcloud
        2. Ruby
        3. Python
        4. C# and .NET
        5. PHP
        6. Drupal
        7. WordPress
        8. Magento e-commerce
        9. Platforms for analyzing, manipulating, and enhancing text
        10. Hydra
        11. UIMA
        12. Apache Stanbol
        13. Carrot2
        14. VuFind
        15. Pop quiz
      7. Summary
    19. B. Pop Quiz Answers
      1. Chapter 1, Getting Ready with the Essentials
        1. Pop quiz
      2. Chapter 2, Indexing with Local PDF Files
        1. Pop quiz
      3. Chapter 3, Indexing Example Data from DBpedia – Paintings
        1. Pop quiz
      4. Chapter 4, Searching the Example Data
        1. Pop quiz
      5. Chapter 5, Extending Search
        1. Pop quiz
      6. Chapter 6, Using Faceted Search – from Searching to Finding
        1. Pop quiz
      7. Chapter 7, Working with Multiple Entities, Multicores, and Distributed Search
        1. Pop quiz
      8. Chapter 8, Indexing External Data sources
        1. Pop quiz
      9. Chapter 9, Introducing Customizations
        1. Pop quiz
      10. Appendix, Solr Clients and Integrations
        1. Pop quiz
    20. Index
18.119.106.135