Appendix D. Resources

Web search engines are your friends. Type lucene in your favorite web search engine and you’ll find many interesting Lucene-related projects. Other good places to look are SourceForge, Google Code, and GitHub; a search for lucene on any of those sites displays a number of open source projects written on top of Lucene.

D.1. Lucene knowledgebases

Search Lucene: http://search-lucene.com/

LucidFind: http://search.lucidimagination.com/

D.2. Internationalization

Unicode page in Wikipedia: http://en.wikipedia.org/wiki/Unicode

The Unicode Consortium: http://unicode.org

Bray, Tim, “Characters vs. Bytes”: www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Green, Dale, “Trail: Internationalization”: http://java.sun.com/docs/books/tutorial/i18n/index.html

Lindenberg, Norbert, and Masayoshi Okutsu, “Supplementary Characters in the Java Platform”: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

Peterson, Erik, “Chinese Character Dictionary—Unicode Version”: www.mandarin-tools.com/chardict_u8.html

Spolsky, Joel, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”: www.joelonsoftware.com/articles/Unicode.html

Davis, Mark, “Globalization Gotchas”: http://macchiato.com/slides/GlobalizationGotchas.ppt

D.3. Language detection

Rosette Language Identifier, http://basistech.com/language-identification

Marr, Rich, “Creating a Language Detection API in 30 minutes”: http://richmarr.word-press.com/2008/10/24/creating-a-language-detection-api-in-30-minutes/

Prager, John M., “Linguini: Language Identification for Multilingual Documents”: ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf

Java Text Categorization Library: http://textcat.sourceforge.net/

NGramJ: http://ngramj.sourceforge.net

Google Ajax Language API: http://code.google.com/apis/ajaxlanguage/documentation/

Sematext Language Identifier: www.sematext.com/products/language-identifier/index.html

Language identification on Wikipedia: http://en.wikipedia.org/wiki/Language_identification

D.4. Term vectors

Vector Space Model on Wikipedia: http://en.wikipedia.org/wiki/Vector_space_model

Latent Semantic Analysis on Wikipedia: http://en.wikipedia.org/wiki/Latent_semantic_analysis

The Latent Semantic Indexing home page: http://lsa.colorado.edu/

“Latent Semantic Indexing (LSI)”: www.cs.utk.edu/~lsi

Stata, Raymie, Krishna Bharat, and Farzin Maghoul, “The Term Vector Database: Fast Access to Indexing Terms for Web Pages”: www9.org/w9cdrom/159/159.html

D.5. Lucene ports

CLucene: www.sourceforge.net/projects/clucene/

Lucene.Net: http://incubator.apache.org/lucene.net/

KinoSearch: www.rectangular.com/kinosearch

Apache Lucy: http://lucene.apache.org/lucy/

PyLucene: http://lucene.apache.org/pylucene/

Ferret: http://ferret.davebalmain.com

PHP, (Zend_Search_Lucene, part of Zend Framework): http://framework.zend.com/

D.6. Case studies

Krugle: www.krugle.org/

DERI, SIREn: http://siren.sindice.com/

LinkedIn, Bobo-Browse: http://snaprojects.jira.com/browse/BOBO/

LinkedIn, Zoie: http://snaprojects.jira.com/browse/ZOIE

D.7. Miscellaneous

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval (Cambridge University Press, 2008). See www-nlp.stanford.edu/IR-book/.

Calishain, Tara, and Rael Dornfest, Google Hacks (O’Reilly, 2003).

Gilleland, Michael, “Levenshtein Distance, in Three Flavors”: www.merriampark.com/ld.htm

GNU Compiler for the Java Programming Language: http://gcc.gnu.org/java/

Google search results for Lucene: www.google.com/search?q=lucene

Apache Lucene Java: http://lucene.apache.org/java

Lucene Sandbox: http://lucene.apache.org/java/3_0_1/lucene-contrib/index.html

Suffix trees on Wikipedia: http://en.wikipedia.org/wiki/Suffix_tree

D.8. IR software

dmoz results for information retrieval: http://dmoz.org/Computers/Software/Information_Retrieval/

Egothor: www.egothor.org/

Minion: https://minion.dev.java.net/

Google Directory results for information retrieval: http://directory.google.com/Top/Computers/Software/Information_Retrieval/

ht://Dig: www.htdig.org

Managing Gigabytes for Java (MG4J): http://mg4j.dsi.unimi.it

Terrier: http://ir.dcs.gla.ac.uk/terrier

Namazu: www.namazu.org

Hounder: http://hounder.org

Search Tools for Web Sites and Intranets: www.searchtools.com

SWISH++: http://swishplusplus.sourceforge.net/

SWISH-E: http://swish-e.org/

Autonomy: www.autonomy.com

Aperture: http://aperture.sourceforge.net/

WebGlimpse: http://webglimpse.net

Xapian: www.xapian.org

The Lemur Toolkit: www.lemurproject.org

D.9. Doug Cutting’s publications

Doug’s official list of publications, from which this was derived, is available at http://lucene.sourceforge.net/publications.html.

D.9.1. Conference papers

“An Interpreter for Phonological Rules,” coauthored with J. Harrington, Proceedings of Institute of Acoustics Autumn Conference, November 1986

“Information Theater versus Information Refinery,” coauthored with J. Pedersen, P.-K. Halvorsen, and M. Withgott, AAAI Spring Symposium on Text-Based Intelligent Systems, March 1990

“Optimizations for Dynamic Inverted Index Maintenance,” coauthored with J. Pedersen, Proceedings of SIGIR ’90, September 1990

“An Object-Oriented Architecture for Text Retrieval,” coauthored with J. O. Pedersen and P.-K. Halvorsen, Proceedings of RIAO ’91, April 1991

“Snippet Search: A Single Phrase Approach to Text Access,” coauthored with J. O. Pedersen and J. W. Tukey, Proceedings of the 1991 Joint Statistical Meetings, August 1991

“A Practical Part-of-Speech Tagger,” coauthored with J. Kupiec, J. Pedersen, and P. Sibun, Proceedings of the Third Conference on Applied Natural Language Processing, April 1992

“Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections,” coauthored with D. Karger, J. Pedersen, and J. Tukey, Proceedings of SIGIR ’92, June 1992

“Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections,” coauthored with D. Karger and J. Pedersen, Proceedings of SIGIR ’93, June 1993

“Porting a Part-of-Speech Tagger to Swedish,” Nordic Datalingvistik Dagen 1993, Stockholm, June 1993

“Space Optimizations for Total Ranking,” coauthored with J. Pedersen, Proceedings of RIAO ’97, Montreal, Quebec, June 1997

D.9.2. U.S. Patents

5,278,980: “Iterative technique for phrase query formation and an information retrieval system employing same,” with J. Pedersen, P.-K. Halvorsen, J. Tukey, E. Bier, and D. Bobrow, filed August 1991

5,442,778: “Scatter-gather: a cluster-based method and apparatus for browsing large document collections,” with J. Pedersen, D. Karger, and J. Tukey, filed November 1991

5,390,259: “Methods and apparatus for selecting semantically significant images in a document image without decoding image content,” with M. Withgott, S. Bagley, D. Bloomberg, D. Huttenlocher, R. Kaplan, T. Cass, P.-K. Halvorsen, and R. Rao, filed November 1991

5,625,554 “Finite-state transduction of related word forms for text indexing and retrieval,” with P.-K. Halvorsen, R.M. Kaplan, L. Karttunen, M. Kay, and J. Pedersen, filed July 1992

5,483,650 “Method of Constant Interaction-Time Clustering Applied to Document Browsing,” with J. Pedersen and D. Karger, filed November 1992

5,384,703 “Method and apparatus for summarizing documents according to theme,” with M. Withgott, filed July 1993

5,838,323 “Document summary computer system user interface,” with D. Rose, J Bornstein, and J. Hatton, filed September 1995

5,867,164 “Interactive document summarization,” with D. Rose, J. Bornstein, and J. Hatton, filed September 1995

5,870,740 “System and method for improving the ranking of information retrieval results for short queries,” with D. Rose, filed September 1996

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.59.192