List of Figures

Chapter 1. Meet Lucene

Figure 1.1. Searching the internet with Google

Figure 1.2. Mac OS X Finder with its embedded search capability

Figure 1.3. Apple’s iTunes intuitively embeds search functionality.

Figure 1.4. Typical components of search application; the shaded components show which parts Lucene handles.

Figure 1.5. Classes used when indexing documents with Lucene

Chapter 2. Building a search index

Figure 2.1. Indexing with Lucene breaks down into three main operations: extracting text from source documents, analyzing it, and saving it to the index.

Figure 2.2. Segmented structure of a Lucene inverted index

Figure 2.3. A single IndexWriter can be shared by multiple threads.

Figure 2.4. An in-memory document buffer helps improve Lucene’s indexing performance.

Chapter 3. Adding search to your application

Figure 3.1. QueryParser translates a textual expression from the end user into an arbitrarily complex query for searching.

Figure 3.2. The relationship between the common classes used for searching

Figure 3.3. Lucene uses this formula to determine a document score based on a query.

Figure 3.4. Illustrating the PhraseQuery slop factor: “quick fox” requires a slop of 1 to match, whereas “fox quick” requires a slop of 3 to match.

Figure 3.5. Sloppy phrase scoring formula

Figure 3.6. FuzzyQuery distance formula

Figure 3.7. A Query can have an arbitrary nested structure, easily expressed with QueryParser’s grouping. This query is achieved by parsing the expression (+"brown fox" +quick) "red dog".

Chapter 4. Lucene’s analysis process

Figure 4.1. Analysis process during indexing. Fields 1 and 2 are analyzed, producing a sequence of tokens; Field 3 is unanalyzed, causing its entire value to be indexed as a single token.

Figure 4.2. A token stream with positional and offset information

Figure 4.3. The hierarchy of classes used to produce tokens: TokenStream is the abstract base class; Tokenizer creates tokens from a Reader; and TokenFilter filters any other TokenStream.

Figure 4.4. An analyzer chain starts with a Tokenizer, to produce initial tokens from the characters read from a Reader, then modifies the tokens with any number of chained TokenFilters.

Figure 4.5. TokenFilter and Tokenizer class hierarchy

Figure 4.6. SynonymAnalyzer visualized as factory automation

Figure 4.7. ChineseDemo illustrating analysis of the title Tao Te Ching

Figure 4.8. Analysis chain that includes character normalization

Chapter 5. Advanced search techniques

Figure 5.1. SpanTermQuery for brown

Figure 5.2. SpanFirstQuery requires that the positional match occur near the start of the field

Figure 5.3. SpanNearQuery requires positional matches to be close to one another.

Figure 5.4. One clause of the SpanOrQuery

Figure 5.5. Term vectors for two documents containing the terms cat and dog

Figure 5.6. Formula for computing the angle between two term vectors

Chapter 6. Extending search

Figure 6.1. Which Mexican restaurant is closest to home (at 0,0) or work (at 10,10)?

Figure 6.2. A filter provides a bit for every document in the index. Only documents with 1 are accepted.

Chapter 7. Extracting text with Tika

Figure 7.1. You can drag and drop any binary document onto Tika’s built-in text extraction tool GUI in order to see what text and metadata Tika extracts.

Chapter 8. Essential Lucene extensions

Figure 8.1. This Luke dialog box provides interesting options for opening the index.

Figure 8.2. Luke’s Overview tab allows you to browse fields and terms.

Figure 8.3. Luke’s Documents tab shows all fields for the document you select.

Figure 8.4. Searching: an easy way to experiment with QueryParser

Figure 8.5. Lucene’s scoring explanation details how the score for a specified document was computed.

Figure 8.6. Luke includes several useful built-in plug-ins.

Figure 8.7. Highlighting matching query terms within text

Figure 8.8. Java classes and interfaces used by Highlighter

Figure 8.9. FastVectorHighlighter supports multicolored hit highlighting out of the box.

Chapter 9. Further Lucene extensions

Figure 9.1. WordNet shows word interconnections, such as this entry for the word search.

Figure 9.2. Viewing the synonyms for search using Luke’s documents tab

Figure 9.3. Three common options for building a Lucene query from a search UI

Figure 9.4. Advanced search user interface for a job search site, implemented with XmlQueryParser

Figure 9.5. Projecting the globe’s threedimensional surface into two dimensions is necessary for spatial search.

Figure 9.6. Tiers and grid boxes recursively divide two dimensions into smaller and smaller areas.

Figure 9.7. Remote searching through RMI, with the server searching multiple indexes

Chapter 11. Lucene administration and performance tuning

Figure 11.1. Steps to test indexing throughput on Wikipedia articles

Figure 11.2. ThreadedIndexWriter manages multiple threads for you.

Figure 11.3. Disk usage while building an index of all Wikipedia documents, with optimize called in the end

Figure 11.4. File descriptor consumption while building an index of Wikipedia articles

Figure 11.5. File descriptor usage by an IndexReader reopening every 30 seconds while Wikipedia articles are indexed

Chapter 12. Case study 1: Krugle

Figure 12.1. The Krugle.org search result page showing matches in multiple projects and multiple source code files

Figure 12.2. Krugle runs two JVMs in a single appliance and indexes content previously collected and digested by external agents.

Chapter 13. Case study 2: SIREn

Figure 13.1. A visual representation of an RDF graph. The RDF graph is split (dashed lines) into three entities identified by the nodes renaud, giovanni, and DERI.

Figure 13.2. Star-shaped query matching the entity renaud, where ? is the bound variable and * a wildcard

Chapter 14. Case study 3: LinkedIn

Figure 14.1. The LinkedIn.com search result page with facets, facet values, and their counts on the right

Figure 14.2. Zoie’s three-index architecture: two in-memory indexes, and one disk-based index

Figure 14.3. The read-only JMX view of Zoie’s attributes, as rendered by JConsole

Figure 14.4. Zoie exposes controls via JMX, allowing an operator to change its behavior at runtime.

Figure 14.5. Distributed search with Zoie

Appendix B. Lucene index format

Figure B.1. The logical, black-box view of a Lucene index

Figure B.2. Unoptimized index with three segments, holding 24 documents

Figure B.3. Detailed look inside the Lucene index format

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.30.19