Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Alfredo Serafini
Apache Solr Beginner's Guide
Apache Solr Beginner's Guide
Table of Contents
Apache Solr Beginner's Guide
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Time for action – heading
What just happened?
Pop quiz – heading
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Ready with the Essentials
Understanding Solr
Learning the powerful aspects of Solr
Working with Java installation
Downloading and installing Java
Configuring CLASSPATH and PATH variables for Java
Installing and testing Solr
Time for action – starting Solr for the first time
What just happened?
Taking a glance at the Solr interface
Time for action – posting some example data
What just happened?
Time for action – testing Solr with cURL
What just happened?
Who uses Solr?
Resources on Solr
How will we use Solr?
Pop quiz
Summary
2. Indexing with Local PDF Files
Understanding and using an index
Posting example documents to the first Solr core
Analyzing the elements we need in Solr core
Time for action – configuring Solr Home and Solr core discovery
What just happened?
Knowing the legacy solr.xml format
Time for action – writing a simple solrconfig.xml file
What just happened?
Time for action – writing a simple schema.xml file
What just happened?
Time for action – starting the new core
What just happened?
Time for action – defining an example document
What just happened?
Time for action – indexing an example document with cURL
What just happened?
Executing the first search on the new core
Adding documents to the index from the web UI
Time for action – updating an existing document
What just happened?
Time for action – cleaning an index
What just happened?
Creating an index prototype from PDF files
Time for action – defining the schema.xml file with only dynamic fields and tokenization
What just happened?
Time for action – writing a simple solrconfig.xml file with an update handler
What just happened?
Testing the PDF file core with dummy data and an example query
Defining a new tokenized field for fulltext
Time for action – using Tika and cURL to extract text from PDFs
What just happened?
Using cURL to index some PDF data
Time for action – finding copies of the same files with deduplication
What just happened?
Time for action – looking inside an index with SimpleTextCodec
What just happened?
Understanding the structure of an inverted index
Understanding how optimization affects the segments of an index
Writing the full configuration for our PDF index example
Writing the solrconfig.xml file
Writing the schema.xml file
Summarizing some easy recipes for the maintenance of an index
Pop quiz
Summary
3. Indexing Example Data from DBpedia – Paintings
Harvesting paintings' data from DBpedia
Analyzing the entities that we want to index
Analyzing the first entity – Painting
Writing Solr core configurations for the first tests
Time for action – defining the basic solrconfig.xml file
What just happened?
Looking at the differences between commits and soft commits
Time for action – defining the simple schema.xml file
What just happened?
Introducing analyzers, tokenizers, and filters
Thinking fields for atomic updates
Indexing a test entity with JSON
Understanding the update chain
Using the atomic update
Understanding how optimistic concurrency works
Time for action – listing all the fields with the CSV output
What just happened?
Defining a new Solr core for our Painting entity
Time for action – refactoring the schema.xml file for the paintings core by introducing tokenization and stop words
What just happened?
Using common field attributes for different use cases
Testing the paintings schema
Collecting the paintings data from DBpedia
Downloading data using the DBpedia SPARQL endpoint
Creating Solr documents for example data
Indexing example data
Testing our paintings core
Time for action - looking at a field using the Schema browser in the web interface
What just happened?
Time for action – searching the new data in the paintings core
What just happened?
Using the Solr web interface for simple maintenance tasks
Pop quiz
Summary
4. Searching the Example Data
Looking at Solr's standard query parameters
Adding a timestamp field for tracking the last modified time
Sending Solr's query parameters over HTTP
Testing HTTP parameters on browsers
Choosing a format for the output
Time for action – searching for all documents with pagination
What just happened?
Time for action – projecting fields with fl
What just happened?
Introducing pseudo-fields and DocTransformers
Adding a constant field using transformers
Time for action – adding a custom DocTransformer to hide empty fields in the results
What just happened?
Looking at core parameters for queries
Using the Lucene query parser with defType
Time for action – searching for terms with a Boolean query
What just happened?
Time for action – using q.op for the default Boolean operator
What just happened?
Time for action – selecting documents with the filter query
What just happened?
Time for action – searching for incomplete terms with the wildcard query
What just happened?
Time for action – using the Boost options
What just happened?
Understanding the basic Lucene score
Time for action – searching for similar terms with fuzzy search
What just happened?
Time for action – writing a simple phrase query example
What just happened?
Time for action – playing with range queries
What just happened?
Time for action – sorting documents with the sort parameter
What just happened?
Playing with the request
Time for action – adding a default parameter to a handler
What just happened?
Playing with the response
Summarizing the parameters that affect result presentation
Analyzing response format
Time for action – enabling XSLT Response Writer with Luke
What just happened?
Listing all fields names with CSV output
Listing all field details for a core
Exploring Solr for Open Data publishing
Publishing results in CSV format
Publishing results with an RSS feed
Good resources on Solr Query Syntax
Pop quiz
Summary
5. Extending Search
Looking at different search parsers – Lucene, Dismax, and Edismax
Starting from the previous core definition
Time for action – inspecting results using the stats and debug components
What just happened?
Looking at Lucene and Solr query parsers
Time for action – debugging a query with the Lucene parser
What just happened?
Time for action – debugging a query with the Dismax parser
What just happened?
Using an Edismax default handler
Time for action – executing a nested Edismax query
What just happened?
A short list of search components
Adding the blooming filter and real-time Get
Time for action – executing a simple pseudo-join query
What just happened?
Highlighting results to improve the search experience
Time for action – generating highlighted snippets over a term
What just happened?
Some idea about geolocalization with Solr
Time for action – creating a repository of cities
What just happened?
Playing more with spatial search
Looking at the new Solr 4 spatial features – from points to polygons
Time for action – expanding the original data with coordinates during the update process
What just happened?
Performing editorial correction on boosting
Introducing the spellcheck component
Time for action – playing with spellchecks
What just happened?
Using a file to spellcheck against a list of controlled words
Collecting some hints for spellchecking analysis
Pop quiz
Summary
6. Using Faceted Search – from Searching to Finding
Exploring documents suggestion and matching with faceted search
Time for action – prototyping an auto-suggester with facets
What just happened?
Time for action – creating wordclouds on facets to view and analyze data
What just happened?
Thinking about faceted search and findability
Faceting for narrowing searches and exploring data
Time for action – defining facets over enumerated fields
What just happened?
Performing data normalization for the keyword field during the update phase
Reading more about Solr faceting parameters
Time for action – finding interesting topics using faceting on tokenized fields with a filter query
What just happened?
Using filter queries for caching filters
Time for action – finding interesting subjects using a facet query
What just happened?
Time for action – using range queries and facet range queries
What just happened?
Time for action – using a hierarchical facet (pivot)
What just happened?
Introducing group and field collapsing
Time for action – grouping results
What just happened?
Playing with terms
Time for action – playing with a term suggester
What just happened?
Thinking about term vectors and similarity
Moving to semantics with vector space models
Looking at the next step – customizing similarity
Time for action – having a look at the term vectors
What just happened?
Reading about functions
Introducing the More Like This component and recommendations
Time for action – obtaining similar documents by More Like This
What just happened?
Adopting a More Like This handler
Pop quiz
Summary
7. Working with Multiple Entities, Multicores, and Distributed Search
Working with multiple entities
Time for action – searching for cities using multiple core joins
What just happened?
Preparing example data for multiple entities
Downloading files for multiple entities
Generating Solr documents
Playing with joins on multicores (a core for every entity)
Using sharding for distributed search
Time for action – playing with sharding (distributed search)
What just happened?
Time for action – finding a document from any shard
What just happened?
Collecting some ideas on schemaless versus normalization
Creating a single denormalized index
Adding a field to track entity type
Analyzing, designing, and refactoring our domain
Using document clustering as a domain analysis tool
Managing index replication
Clustering Solr for distributed search using SolrCloud
Taking a journey from single core to SolrCloud
Understanding why we need Zookeeper
Time for action – testing SolrCloud and Zookeeper locally
What just happened?
Looking at the suggested configurations for SolrCloud
Changing the schema.xml file
Changing the solrconfig.xml file
Knowing the pros and cons of SolrCloud
Pop quiz
Summary
8. Indexing External Data sources
Stepping further into the real world
Collecting example data from the Web Gallery of Art site
Time for action – indexing data from a database (for example, a blog or an e-commerce website)
What just happened?
Time for action – handling sub-entities (for example, joins on complex data)
What just happened?
Time for action – indexing incrementally using delta imports
What just happened?
Time for action – indexing CSV (for example, open data)
What just happened?
Time for action – importing Solr XML document files
What just happened?
Importing data from another Solr instance
Indexing emails
Time for action – indexing rich documents (for example, PDF)
What just happened?
Adding more consideration about tuning
Understanding Java Virtual Machine, threads, and Solr
Choosing the correct directory for implementation
Adopting Solr cache
Time for action – indexing artist data from Tate Gallery and DBpedia
What just happened?
Using DataImportHandler
Pop quiz
Summary
9. Introducing Customizations
Looking at the Solr customizations
Adding some more details to the core discovery
Playing with specific languages
Time for action – detecting language with Tika and LangDetect
What just happened?
Introducing stemming for query expansion
Time for action – adopting a stemmer
What just happened?
Testing language analysis with JUnit and Scala
Writing new Solr plugins
Introducing Solr plugin structure and lifecycle
Implementing interfaces for obtaining information
Following an example plugin lifecycle
Time for action – writing a new ResponseWriter plugin with the Thymeleaf library
What just happened?
Using Maven for development
Time for action – integrating Stanford NER for Named Entity extraction
What just happened?
Pointing ideas for Solr's customizations
Pop quiz
Summary
A. Solr Clients and Integrations
Introducing SolrJ – an embedded or remote Solr client using the Java (JVM) API
Time for action – playing with an embedded Solr instance
What just happened?
Choosing between an embedded or remote Solr instance
Time for action – playing with an external HttpSolrServer
What just happened?
Time for action – using Bean Scripting Framework and JavaScript
What just happened?
Jahia CMS
Magnolia CMS
Alfresco DMS and CMS
Liferay
Broadleaf
Apache Jena
Solr Groovy or the Grails plugin
Solr scala
Spring data
Writing Solr clients and integrations outside JVM
JavaScript
Taking a glance at ajax-solr, solrstrap, facetview, and jcloud
Ruby
Python
C# and .NET
PHP
Drupal
WordPress
Magento e-commerce
Platforms for analyzing, manipulating, and enhancing text
Hydra
UIMA
Apache Stanbol
Carrot2
VuFind
Pop quiz
Summary
B. Pop Quiz Answers
Chapter 1, Getting Ready with the Essentials
Pop quiz
Chapter 2, Indexing with Local PDF Files
Pop quiz
Chapter 3, Indexing Example Data from DBpedia – Paintings
Pop quiz
Chapter 4, Searching the Example Data
Pop quiz
Chapter 5, Extending Search
Pop quiz
Chapter 6, Using Faceted Search – from Searching to Finding
Pop quiz
Chapter 7, Working with Multiple Entities, Multicores, and Distributed Search
Pop quiz
Chapter 8, Indexing External Data sources
Pop quiz
Chapter 9, Introducing Customizations
Pop quiz
Appendix, Solr Clients and Integrations
Pop quiz
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Table of Contents
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset