About this Book

Lucene in Action, Second Edition delivers details, best practices, caveats, tips, and tricks for using the best open-source search engine available.

This book assumes the reader is familiar with basic Java programming. Lucene’s core itself is a single Java Archive (JAR) file, less than 1MB and with no dependencies, and integrates into the simplest Java stand-alone console program as well as the most sophisticated enterprise application.

Roadmap

We organized part 1 of this book to cover the core Lucene Application Programming Interface (API) in the order you’re likely to encounter it as you integrate Lucene into your applications:

  • In chapter 1, you meet Lucene. We introduce basic information-retrieval terminology and describe the components of modern search applications. Without wasting any time, we immediately build simple indexing and searching applications that you can put right to use or adapt to your needs. This example application opens the door for exploring the rest of Lucene’s capabilities.
  • Chapter 2 familiarizes you with Lucene’s indexing operations. We describe the various field types and techniques for indexing numbers and dates. Tuning the indexing process, optimizing an index, using near real-time search and handling thread-safety are covered.
  • Chapter 3 takes you through basic searching, including details of how Lucene ranks documents based on a query. We discuss the fundamental query types as well as how they can be created through human-entered query expressions using Lucene’s QueryParser.
  • Chapter 4 delves deep into the heart of Lucene’s indexing magic, the analysis process. We cover the analyzer building blocks including tokens, token streams, and token filters. Each of the built-in analyzers gets its share of attention and detail. We build several custom analyzers, showcasing synonym injection and metaphone (like soundex) replacement. Analysis of non-English languages is covered, with specific examples of analyzing Chinese text.
  • Chapter 5 picks up where the searching chapter left off, with analysis now in mind. We cover several advanced searching features, including sorting, filtering, and term vectors. The advanced query types make their appearance, including the spectacular SpanQuery family. Finally, we cover Lucene’s built-in support for querying multiple indexes, even in parallel.
  • Chapter 6 goes well beyond advanced searching, showing you how to extend Lucene’s searching capabilities. You’ll learn how to customize search results sorting, extend query expression parsing, implement hit collecting, and tune query performance. Whew!

Part 2 goes beyond Lucene’s built-in facilities and shows you what can be done around and above Lucene:

  • In chapter 7, we show how to use Tika, another open-source project under the same Apache Lucene umbrella, to parse documents in many formats, in order to extract their text and metadata.
  • Chapter 8 shows the important and popular set of extensions and tools around Lucene. Most of these are referred to as “contrib modules”, in reference to the contrib subdirectory that houses them in Lucene’s source control system. We start with Luke, an amazingly useful standalone tool for interacting with a Lucene index, and then move on to contrib modules that enable highlighting search terms and applying spelling correction, along with other goodies like non-English-language analyzers and several new query types.
  • Chapter 9 covers additional functionality offered by Lucene’s contrib modules, including chaining multiple filters together, storing an index in a Berkeley database, and leveraging synonyms from WordNet. We show two fast options for storing an index entirely in RAM, and then move on to xml-query-parser which enables creating queries from XML. We see how to do spatial searching with Lucene, and touch on a new modular QueryParser, plus a few odds and ends.
  • Chapter 10 demonstrates how to access Lucene functionality from various programming languages, such as C++, C#, Python, Perl and Ruby.
  • Chapter 11 covers the administrative side of Lucene, including how to understand disk, memory, and file descriptor usage. We see how to tune Lucene for various metrics like indexing throughput and latency, show you to make a hot backup of the index without pausing indexing, and how to easily take advantage of multiple threads during indexing and searching.

Part 3 (chapters 12, 13, and 14) brings all the technical details of Lucene back into focus with case studies contributed by those who have built interesting, fast, and scalable applications with Lucene at their core.

What’s new in the second edition?

Much has changed in Lucene in the 5 years since this book was originally published. As is often the case with a successful open-source project with a strong technical architecture, a robust community of users and developers has thrived over time, and from all that energy has emerged a number of amazing improvements. Here’s a sampling of the changes:

  • Using near real-time searching
  • Using Tika to extract text from documents
  • Indexing with NumericField and performing fast numeric range querying with NumericRangeQuery
  • Updating and deleting documents using IndexWriter
  • Working with IndexWriter’s new transactional semantics (commit, rollback)
  • Improving search concurrency with read-only IndexReaders and NIOFSDirectory
  • Enabling pure Boolean searching
  • Adding payloads to your index and using them with BoostingTermQuery
  • Using IndexReader.reopen to efficiently open a new reader from an existing one
  • Understanding resource usage, like memory, disk, and file descriptors
  • Using Function queries
  • Tuning for performance metrics like indexing and searching throughput
  • Making a hot backup of your index without pausing indexing
  • Using new ports of Lucene to other programming languages
  • Measuring performance using the “benchmark” contrib package
  • Understanding the new reusable TokenStream API
  • Using threads to gain concurrency during indexing and searching
  • Using FieldSelector to speed up loading of stored fields
  • Using TermVectorMapper to customize how term vectors are loaded
  • Understanding simplifications to Lucene’s locking
  • Using custom LockFactory, DeletionPolicy, IndexDeletionPolicy, MergePolicy, and MergeScheduler implementations
  • Using new contrib modules, like XMLQueryParser and Local Lucene search
  • Debugging common problems

Entirely new case studies have been added, in Chapters 12, 13 and 14. A new chapter (11) has been added to cover the administrative aspects of Lucene. Chapter 7, which previously described a custom framework for parsing different document types, has been rewritten entirely based on Tika. In addition all code samples have been updated to Lucene’s 3.0.1 APIs. And of course lots of great feedback from our readers has been folded in (thank you, and please keep it coming!).

Who should read this book?

Developers who need powerful search capabilities embedded in their applications should read this book. Lucene in Action, Second Edition is also suitable for developers who are curious about Lucene or indexing and search techniques, but who may not have an immediate need to use it. Adding Lucene know-how to your toolbox is valuable for future projects—search is a hot topic and will continue to be in the future.

This book primarily uses the Java version of Lucene (from Apache), and the majority of the code examples use the Java language. Readers familiar with Java will be right at home. Java expertise will be helpful; however, Lucene has been ported to a number of other languages including C++, C#, Python, and Perl. The concepts, techniques, and even the API itself are comparable between the Java and other language versions of Lucene.

Code examples

The source code for this book is available from Manning’s website at http://www.manning.com/LuceneinActionSecondEdition or http://www.manning.com/hatcher3. Instructions for using this code are provided in the README file included with the source-code package.

The majority of the code shown in this book was written by us and is included in the source-code package, licensed under the Apache Software License (http://www.apache.org/licenses/LICENSE-2.0). Some code (particularly the case-study code, and the examples from Lucene’s ports to other programming languages) isn’t provided in our source-code package; the code snippets shown there are owned by the contributors and are donated as is. In a couple of cases, we have included a small snippet of code from Lucene’s codebase, which is also licensed under Apache Software License 2.0.

Code examples don’t include package and import statements, to conserve space; refer to the actual source code for these details. Likewise, in the name of brevity and keeping examples focused on Lucene’s code, there are numerous places where we simply declare throws Exception, while for production code you should declare and catch only specific exceptions and implement proper handling when exceptions occur. In some cases there are fragments of code, inlined in the text, that are not full standalone examples; these cases are included in source files named Fragments.java, under each subdirectory.

Why JUnit?

We believe code examples in books should be top-notch quality and real-world applicable. The typical “hello world” examples often insult our intelligence and generally do little to help readers see how to really adapt to their environment.

We’ve taken a unique approach to the code examples in Lucene in Action, Second Edition. Many of our examples are actual JUnit test cases (http://www.junit.org), version 4.1. JUnit, the de facto Java unit-testing framework, easily allows code to assert that a particular assumption works as expected in a repeatable fashion. It also cleanly separates what we are trying to accomplish, by showing the small test case up front, from how we accomplish it, by showing the source code behind the APIs invoked by the test case. Automating JUnit test cases through an IDE or Ant allows one-step (or no steps with continuous integration) confidence building. We chose to use JUnit in this book because we use it daily in our other projects and want you to see how we really code. Test Driven Development (TDD) is a development practice we strongly espouse.

If you’re unfamiliar with JUnit, please read the JUnit primer section. We also suggest that you read Pragmatic Unit Testing in Java with JUnit by Dave Thomas and Andy Hunt, followed by Manning’s JUnit in Action by Vincent Massol and Ted Husted, a second edition of which is in the works by Petar Tahchiev, Felipe Leme, Vincent Massol, and Gary Gregory.

Code conventions and downloads

Source code in listings or in text is in a fixed width font to separate it from ordinary text. Java method names, within text, generally won’t include the full method signature.

In order to accommodate the available page space, code has been formatted with a limited width, including line continuation markers where appropriate.

We don’t include import statements and rarely refer to fully qualified class names—this gets in the way and takes up valuable space. Refer to Lucene’s Javadocs for this information. All decent IDEs have excellent support for automatically adding import statements; Erik blissfully codes without knowing fully qualified classnames using IDEA IntelliJ, Otis and Mike both use XEmacs. Add the Lucene JAR to your project’s classpath, and you’re all set. Also on the classpath issue (which is a notorious nuisance), we assume that the Lucene JAR and any other necessary JARs are available in the classpath and don’t show it explicitly. The lib directory, with the source code, includes JARs that the source code uses. When you run the ant targets, these JARs are placed on the classpath for you.

We’ve created a lot of examples for this book that are freely available to you. A .zip file of all the code is available from Manning’s web site for Lucene in Action: http://www.manning.com/LuceneinActionSecondEdition. Detailed instructions on running the sample code are provided in the main directory of the expanded archive as a README file.

Our test data

Most of our book revolves around a common set of example data to provide consistency and avoid having to grok an entirely new set of data for each section. This example data consists of book details. Table 1 shows the data so that you can reference it and make sense of our examples.

Table 1. Sample data used throughout this book

Title / Author

Category

Subject

A Modern Art of Education Rudolf Steiner /education/pedagogy education philosophy psychology practice Waldorf
Lipitor, Thief of Memory Duane Graveline, Kilmer S. McCully, Jay S. Cohen /health cholesterol, statin, lipitor
Nudge: Improving Decisions About Health, Wealth, and Happiness Richard H. Thaler, Cass R. Sunstein /health information architecture, decisions, choices
Imperial Secrets of Health and Longevity Bob Flaws /health/alternative/Chinese diet chinese medicine qi gong health herbs
Tao Te Ching Stephen Mitchell /philosophy/eastern taoism
Gödel, Escher, Bach: an Eternal Golden Braid Douglas Hofstadter /technology/computers/ai artificial intelligence number theory mathematics music
Mindstorms: Children, Computers, And Powerful Ideas Seymour Papert /technology/computers/programming/education children computers powerful ideas LOGO education
Ant in Action Steve Loughran, Erik Hatcher /technology/computers/programming apache ant build tool junit java development
JUnit in Action, Second Edition Petar Tahchiev, Felipe Leme, Vincent Massol, Gary Gregory /technology/computers/programming junit unit testing mock objects
Lucene in Action, Second Edition Michael McCandless, Erik Hatcher, Otis Gospodnetić /technology/computers/programming lucene search java
Extreme Programming Explained Kent Beck /technology/computers/programming/methodology extreme programming agile test driven development methodology
Tapestry in Action Howard Lewis-Ship /technology/computers/programming tapestry web user interface components
The Pragmatic Programmer Dave Thomas, Andy Hunt /technology/computers/programming pragmatic agile methodology developer tools

The data, besides the fields shown in the table, includes fields for ISBN, URL, and publication month. When you unzip the source code available for download at www.manning.com/hatcher3, the books are represented as *.properties files under the data sub-directory, and the command-line tool at src/lia/common/CreateTestIndex.java is used to create the test index used throughout the book. The fields for category and subject are our own subjective values, but the other information is objectively factual about the books.

Author Online

The purchase of Lucene in Action, Second Edition includes free access to a web forum run by Manning Publications, where you can discuss the book with the authors and other readers. To access the forum and subscribe to it, point your web browser to http://www.manning.com/LuceneinActionSecondEdition. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.

About the title

By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering. According to research in cognitive science, the things people remember are things they discover during self-motivated exploration.

Although no one at Manning is a cognitive scientist, we are convinced that for learning to become permanent it must pass through stages of exploration, play, and, interestingly, re-telling of what is being learned. People understand and remember new things, which is to say they master them, only after actively exploring them. Humans learn in action. An essential part of an In Action guide is that it is example-driven. It encourages the reader to try things out, to play with new code, and explore new ideas.

There is another, more mundane, reason for the title of this book: our readers are busy. They use books to do a job or solve a problem. They need books that allow them to jump in and jump out easily and learn just what they want just when they want it. They need books that aid them in action. The books in this series are designed for such readers.

About the cover illustration

The figure on the cover of Lucene in Action, Second Edition is “An inhabitant of the coast of Syria.” The illustration is taken from a collection of costumes of the Ottoman Empire published on January 1, 1802, by William Miller of Old Bond Street, London. The title page is missing from the collection and we have been unable to track it down to date. The book’s table of contents identifies the figures in both English and French, and each illustration bears the names of two artists who worked on it, both of whom would no doubt be surprised to find their art gracing the front cover of a computer programming book?two hundred years later.

The collection was purchased by a Manning editor at an antiquarian flea market in the “Garage” on West 26th Street in Manhattan. The seller was an American based in Ankara, Turkey, and the transaction took place just as he was packing up his stand for the day. The Manning editor did not have on his person the substantial amount of cash that was required for the purchase and a credit card and check were both politely turned down.

With the seller flying back to Ankara that evening the situation was getting hopeless. What was the solution? It turned out to be nothing more than an old-fashioned verbal agreement sealed with a handshake. The seller simply proposed that the money be transferred to him by wire and the editor walked out with the seller’s bank information on a piece of paper and the portfolio of images under his arm. Needless to say, we transferred the funds the next day, and we remain grateful and impressed by this unknown person’s trust in one of us. It recalls something that might have happened a long time ago.

The pictures from the Ottoman collection, like the other illustrations that appear on our covers, bring to life the richness and variety of dress customs of two centuries ago. They recall the sense of isolation and distance of that period—and of every other historic period except our own hyperkinetic present.

Dress codes have changed since then and the diversity by region, so rich at the time, has faded away. It is now often hard to tell the inhabitant of one continent from another. Perhaps, trying to view it optimistically, we have traded a cultural and visual diversity for a more varied personal life. Or a more varied and interesting intellectual and technical life.

We at Manning celebrate the inventiveness, the initiative, and, yes, the fun of the computer business with book covers based on the rich diversity of regional life of two centuries ago—brought back to life by the pictures from this collection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.253.55