Table of Contents

Copyright

Brief Table of Contents

Table of Contents

Praise for the First Edition

More Praise for the First Edition

Foreword

Preface

Preface to the First Edition

Acknowledgments

About this Book

JUnit primer

About the Authors

1. Core Lucene

Chapter 1. Meet Lucene

1.1. Dealing with information explosion

1.2. What is Lucene?

1.2.1. What Lucene can do

1.2.2. History of Lucene

1.3. Lucene and the components of a search application

1.3.1. Components for indexing

1.3.2. Components for searching

1.3.3. The rest of the search application

1.3.4. Where Lucene fits into your application

1.4. Lucene in action: a sample application

1.4.1. Creating an index

1.4.2. Searching an index

1.5. Understanding the core indexing classes

1.5.1. IndexWriter

1.5.2. Directory

1.5.3. Analyzer

1.5.4. Document

1.5.5. Field

1.6. Understanding the core searching classes

1.6.1. IndexSearcher

1.6.2. Term

1.6.3. Query

1.6.4. TermQuery

1.6.5. TopDocs

1.7. Summary

Chapter 2. Building a search index

2.1. How Lucene models content

2.1.1. Documents and fields

2.1.2. Flexible schema

2.1.3. Denormalization

2.2. Understanding the indexing process

2.2.1. Extracting text and creating the document

2.2.2. Analysis

2.2.3. Adding to the index

2.3. Basic index operations

2.3.1. Adding documents to an index

2.3.2. Deleting documents from an index

2.3.3. Updating documents in the index

2.4. Field options

2.4.1. Field options for indexing

2.4.2. Field options for storing fields

2.4.3. Field options for term vectors

2.4.4. Reader, TokenStream, and byte[] field values

2.4.5. Field option combinations

2.4.6. Field options for sorting

2.4.7. Multivalued fields

2.5. Boosting documents and fields

2.5.1. Boosting documents

2.5.2. Boosting fields

2.5.3. Norms

2.6. Indexing numbers, dates, and times

2.6.1. Indexing numbers

2.6.2. Indexing dates and times

2.7. Field truncation

2.8. Near-real-time search

2.9. Optimizing an index

2.10. Other directory implementations

2.11. Concurrency, thread safety, and locking issues

2.11.1. Thread and multi-JVM safety

2.11.2. Accessing an index over a remote file system

2.11.3. Index locking

2.12. Debugging indexing

2.13. Advanced indexing concepts

2.13.1. Deleting documents with IndexReader

2.13.2. Reclaiming disk space used by deleted documents

2.13.3. Buffering and flushing

2.13.4. Index commits

2.13.5. ACID transactions and index consistency

2.13.6. Merging

2.14. Summary

Chapter 3. Adding search to your application

3.1. Implementing a simple search feature

3.1.1. Searching for a specific term

3.1.2. Parsing a user-entered query expression: QueryParser

3.2. Using IndexSearcher

3.2.1. Creating an IndexSearcher

3.2.2. Performing searches

3.2.3. Working with TopDocs

3.2.4. Paging through results

3.2.5. Near-real-time search

3.3. Understanding Lucene scoring

3.3.1. How Lucene scores

3.3.2. Using explain() to understand hit scoring

3.4. Lucene’s diverse queries

3.4.1. Searching by term: TermQuery

3.4.2. Searching within a term range: TermRangeQuery

3.4.3. Searching within a numeric range: NumericRangeQuery

3.4.4. Searching on a string: PrefixQuery

3.4.5. Combining queries: BooleanQuery

3.4.6. Searching by phrase: PhraseQuery

3.4.7. Searching by wildcard: WildcardQuery

3.4.8. Searching for similar terms: FuzzyQuery

3.4.9. Matching all documents: MatchAllDocsQuery

3.5. Parsing query expressions: QueryParser

3.5.1. Query.toString

3.5.2. TermQuery

3.5.3. Term range searches

3.5.4. Numeric and date range searches

3.5.5. Prefix and wildcard queries

3.5.6. Boolean operators

3.5.7. Phrase queries

3.5.8. Fuzzy queries

3.5.9. MatchAllDocsQuery

3.5.10. Grouping

3.5.11. Field selection

3.5.12. Setting the boost for a subquery

3.5.13. To QueryParse or not to QueryParse?

3.6. Summary

Chapter 4. Lucene’s analysis process

4.1. Using analyzers

4.1.1. Indexing analysis

4.1.2. QueryParser analysis

4.1.3. Parsing vs. analysis: when an analyzer isn’t appropriate

4.2. What’s inside an analyzer?

4.2.1. What’s in a token?

4.2.2. TokenStream uncensored

4.2.3. Visualizing analyzers

4.2.4. TokenFilter order can be significant

4.3. Using the built-in analyzers

4.3.1. StopAnalyzer

4.3.2. StandardAnalyzer

4.3.3. Which core analyzer should you use?

4.4. Sounds-like querying

4.5. Synonyms, aliases, and words that mean the same

4.5.1. Creating SynonymAnalyzer

4.5.2. Visualizing token positions

4.6. Stemming analysis

4.6.1. StopFilter leaves holes

4.6.2. Combining stemming and stop-word removal

4.7. Field variations

4.7.1. Analysis of multivalued fields

4.7.2. Field-specific analysis

4.7.3. Searching on unanalyzed fields

4.8. Language analysis issues

4.8.1. Unicode and encodings

4.8.2. Analyzing non-English languages

4.8.3. Character normalization

4.8.4. Analyzing Asian languages

4.8.5. Zaijian

4.9. Nutch analysis

4.10. Summary

Chapter 5. Advanced search techniques

5.1. Lucene’s field cache

5.1.1. Loading field values for all documents

5.1.2. Per-segment readers

5.2. Sorting search results

5.2.1. Sorting search results by field value

5.2.2. Sorting by relevance

5.2.3. Sorting by index order

5.2.4. Sorting by a field

5.2.5. Reversing sort order

5.2.6. Sorting by multiple fields

5.2.7. Selecting a sorting field type

5.2.8. Using a nondefault locale for sorting

5.3. Using MultiPhraseQuery

5.4. Querying on multiple fields at once

5.5. Span queries

5.5.1. Building block of spanning, SpanTermQuery

5.5.2. Finding spans at the beginning of a field

5.5.3. Spans near one another

5.5.4. Excluding span overlap from matches

5.5.5. SpanOrQuery

5.5.6. SpanQuery and QueryParser

5.6. Filtering a search

5.6.1. TermRangeFilter

5.6.2. NumericRangeFilter

5.6.3. FieldCacheRangeFilter

5.6.4. Filtering by specific terms

5.6.5. Using QueryWrapperFilter

5.6.6. Using SpanQueryFilter

5.6.7. Security filters

5.6.8. Using BooleanQuery for filtering

5.6.9. PrefixFilter

5.6.10. Caching filter results

5.6.11. Wrapping a filter as a query

5.6.12. Filtering a filter

5.6.13. Beyond the built-in filters

5.7. Custom scoring using function queries

5.7.1. Function query classes

5.7.2. Boosting recently modified documents using function queries

5.8. Searching across multiple Lucene indexes

5.8.1. Using MultiSearcher

5.8.2. Multithreaded searching using ParallelMultiSearcher

5.9. Leveraging term vectors

5.9.1. Books like this

5.9.2. What category?

5.9.3. TermVectorMapper

5.10. Loading fields with FieldSelector

5.11. Stopping a slow search

5.12. Summary

Chapter 6. Extending search

6.1. Using a custom sort method

6.1.1. Indexing documents for geographic sorting

6.1.2. Implementing custom geographic sort

6.1.3. Accessing values used in custom sorting

6.2. Developing a custom Collector

6.2.1. The Collector base class

6.2.2. Custom collector: BookLinkCollector

6.2.3. AllDocCollector

6.3. Extending QueryParser

6.3.1. Customizing QueryParser’s behavior

6.3.2. Prohibiting fuzzy and wildcard queries

6.3.3. Handling numeric field-range queries

6.3.4. Handling date ranges

6.3.5. Allowing ordered phrase queries

6.4. Custom filters

6.4.1. Implementing a custom filter

6.4.2. Using our custom filter during searching

6.4.3. An alternative: FilteredQuery

6.5. Payloads

6.5.1. Producing payloads during analysis

6.5.2. Using payloads during searching

6.5.3. Payloads and SpanQuery

6.5.4. Retrieving payloads via TermPositions

6.6. Summary

2. Applied Lucene

Chapter 7. Extracting text with Tika

7.1. What is Tika?

7.2. Tika’s logical design and API

7.3. Installing Tika

7.4. Tika’s built-in text extraction tool

7.5. Extracting text programmatically

7.5.1. Indexing a Lucene document

7.5.2. The Tika utility class

7.5.3. Customizing parser selection

7.6. Tika’s limitations

7.7. Indexing custom XML

7.7.1. Parsing using SAX

7.7.2. Parsing and indexing using Apache Commons Digester

7.8. Alternatives

7.9. Summary

Chapter 8. Essential Lucene extensions

8.1. Luke, the Lucene Index Toolbox

8.1.1. Overview: seeing the big picture

8.1.2. Document browsing

8.1.3. Using QueryParser to search

8.1.4. Files and plugins view

8.2. Analyzers, tokenizers, and TokenFilters

8.2.1. SnowballAnalyzer

8.2.2. Ngram filters

8.2.3. Shingle filters

8.2.4. Obtaining the contrib analyzers

8.3. Highlighting query terms

8.3.1. Highlighter components

8.3.2. Standalone highlighter example

8.3.3. Highlighting with CSS

8.3.4. Highlighting search results

8.4. FastVectorHighlighter

8.5. Spell checking

8.5.1. Generating a suggestions list

8.5.2. Selecting the best suggestion

8.5.3. Presenting the result to the user

8.5.4. Some ideas to improve spell checking

8.6. Fun and interesting Query extensions

8.6.1. MoreLikeThis

8.6.2. FuzzyLikeThisQuery

8.6.3. BoostingQuery

8.6.4. TermsFilter

8.6.5. DuplicateFilter

8.6.6. RegexQuery

8.7. Building contrib modules

8.7.1. Get the sources

8.7.2. Ant in the contrib directory

8.8. Summary

Chapter 9. Further Lucene extensions

9.1. Chaining filters

9.2. Storing an index in Berkeley DB

9.3. Synonyms from WordNet

9.3.1. Building the synonym index

9.3.2. Tying WordNet synonyms into an analyzer

9.4. Fast memory-based indices

9.5. XML QueryParser: Beyond “one box” search interfaces

9.5.1. Using XmlQueryParser

9.5.2. Extending the XML query syntax

9.6. Surround query language

9.7. Spatial Lucene

9.7.1. Indexing spatial data

9.7.2. Searching spatial data

9.7.3. Performance characteristics of Spatial Lucene

9.8. Searching multiple indexes remotely

9.9. Flexible QueryParser

9.10. Odds and ends

9.11. Summary

Chapter 10. Using Lucene from other programming languages

10.1. Ports primer

10.1.1. Trade-offs

10.1.2. Choosing the right port

10.2. CLucene (C++)

10.2.1. Motivation

10.2.2. API and index compatibility

10.2.3. Supported platforms

10.2.4. Current and future work

10.3. Lucene.Net (C# and other .NET languages)

10.3.1. API compatibility

10.3.2. Index compatibility

10.4. KinoSearch and Lucy (Perl)

10.4.1. KinoSearch

10.4.2. Lucy

10.4.3. Other Perl options

10.5. Ferret (Ruby)

10.6. PHP

10.6.1. Zend Framework

10.6.2. PHP Bridge

10.7. PyLucene (Python)

10.7.1. API compatibility

10.7.2. Other Python options

10.8. Solr (many programming languages)

10.9. Summary

Chapter 11. Lucene administration and performance tuning

11.1. Performance tuning

11.1.1. Simple performance-tuning steps

11.1.2. Testing approach

11.1.3. Tuning for index-to-search delay

11.1.4. Tuning for indexing throughput

11.1.5. Tuning for search latency and throughput

11.2. Threads and concurrency

11.2.1. Using threads for indexing

11.2.2. Using threads for searching

11.3. Managing resource consumption

11.3.1. Disk space

11.3.2. File descriptors

11.3.3. Memory

11.4. Hot backups of the index

11.4.1. Creating the backup

11.4.2. Restoring the index

11.5. Common errors

11.5.1. Index corruption

11.5.2. Repairing an index

11.6. Summary

3. Case studies

Chapter 12. Case study 1: Krugle

12.1. Introducing Krugle

12.2. Appliance architecture

12.3. Search performance

12.4. Parsing source code

12.5. Substring searching

12.6. Query vs. search

12.7. Future improvements

12.7.1. FieldCache memory usage

12.7.2. Combining indexes

12.8. Summary

Chapter 13. Case study 2: SIREn

13.1. Introducing SIREn

13.2. SIREn’s benefits

13.2.1. Searching across all fields

13.2.2. A single efficient lexicon

13.2.3. Flexible fields

13.2.4. Efficient handling of multivalued fields

13.3. Indexing entities with SIREn

13.3.1. Data model

13.3.2. Implementation issues

13.3.3. Index schema

13.3.4. Data preparation before indexing

13.4. Searching entities with SIREn

13.4.1. Searching content

13.4.2. Restricting search within a cell

13.4.3. Combining cells into tuples

13.4.4. Querying an entity description

13.5. Integrating SIREn in Solr

13.6. Benchmark

13.7. Summary

Chapter 14. Case study 3: LinkedIn

14.1. Faceted search with Bobo Browse

14.1.1. Bobo Browse design

14.1.2. Beyond simple faceting

14.2. Real-time search with Zoie

14.2.1. Zoie architecture

14.2.2. Real-time vs. near-real-time

14.2.3. Documents and indexing requests

14.2.4. Custom IndexReaders

14.2.5. Comparison with Lucene near-real-time search

14.2.6. Distributed search

14.3. Summary

Appendix A. Installing Lucene

A.1. Binary installation

A.2. Running the command-line demo

A.3. Running the web application demo

A.4. Building from source

A.5. Troubleshooting

Appendix B. Lucene index format

B.1. Logical index view

B.2. About index structure

B.2.1. Understanding the multifile index structure

B.2.2. Understanding the compound index structure

B.2.3. Converting from one index structure to the other

B.3. Inverted index

Field Names (.FNM)

Term Dictionary (.TIS, .TII)

Term Frequencies

Term Positions

Stored Fields

Term Vectors

Norms

Deletions

B.4. Summary

Appendix C. Lucene/contrib benchmark

C.1. Running an algorithm

C.2. Parts of an algorithm file

C.2.1. Content source and document maker

C.2.2. Query maker

C.3. Control structures

C.4. Built-in tasks

C.4.1. Creating and using line files

C.4.2. Built-in reporting tasks

C.5. Evaluating search quality

C.6. Errors

C.7. Summary

D. Resources

D.1. Lucene knowledgebases

D.2. Internationalization

D.3. Language detection

D.4. Term vectors

D.5. Lucene ports

D.6. Case studies

D.7. Miscellaneous

D.8. IR software

D.9. Doug Cutting’s publications

D.9.1. Conference papers

D.9.2. U.S. Patents

Index

List of Figures

List of Tables

List of Listings

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.105.114