Semantic Search
You have reached the last chapter of this book, and in this journey you have learned about the significant features of Solr and the nitty-gritty of using it. In previous chapters, you also learned about information retrieval concepts and relevance ranking, which are essential for understanding Solr’s internals and the how and why of scoring. This knowledge is indispensable for what you will be doing most of the time: tuning the document relevance. With all of this information, you should be able to develop an effective search engine that retrieves relevant documents for the query, ranks them appropriately, and provides other features that add to the user experience.
So far, so good, but users expect more. If you look at some of the search applications on the market, they are implementing lots of innovative capabilities and plugging in diverse frameworks and components to take the search experience to the next level. As the expectations are set high, matching beyond keywords to understand the underlying semantics and user intent is needed. Google, for example, also acts like a question-answering system. It applies enormous intelligence to understand the semantics of the query and acts accordingly. If you analyze your query logs, you will find that a considerable number of queries contain the user’s intent and not just the keywords, though that depends on the domain.
At this point, you should be able to develop a decent keyword-based search engine, but with a few limitations. When you put it all together, the system won’t understand the semantics. If your application caters to a medical domain, for example, a user query of heart attack will fail to retrieve results for cardiac arrest, which might be more interesting to the medical practitioners.
A semantic search addresses the limitations of a keyword-based search by understanding the user intent and the contextual meaning of terms. The acquired knowledge can be utilized in many ways to improve the search accuracy and relevance ranking.
Semantic search is an advanced and broad topic, and presenting its techniques of text analytics would require a dedicated book. In this chapter, you will learn about some of the techniques in their simplest forms as well as references to resources that can be utilized to explore further. This chapter covers the following topics:
Limitations of Keyword Systems
Keyword-based document ranking fundamentally depends upon the statistical information about the query terms. Although they are beneficial for many use cases, they often do not provide users with valuable results if the user fails to formulate an appropriate query or provides an intent query. If the user is rephrasing the same query multiple times while searching, it implicitly signifies the need for extending the search engine to support advanced text processing and semantic capabilities. Computer scientist Hans Peter Luhn describes the limitations of keyword systems:
This rather unsophisticated argument on “significance” avoids such linguistic implications as grammar and syntax....No attention is paid to the logical and semantic relationships the author has established.
The primary limitations of a keyword-based system can be categorized as follows:
Semantic Search
Semantic search refers to a set of techniques that interprets the intent, context, concept, meaning, and relationships between terms. The idea is to develop a system that follows a cognitive process to understand terms similar to the way we humans do. Technopedia.com provides this definition of semantic search:
Semantic search is a data searching technique in which a search query aims to not only find keywords, but to determine the intent and contextual meaning of the words a person is using for search.
The potential of leveraging semantic capabilities in your search application is unbound. The way you utilize them depends a lot on your domain, data, and search requirements. Google, for example, uses semantic capabilities for delivering answers and not just links. Figure 11-1 shows an example of the semantic capabilities of Google, which precisely understands the user intent and answers the query.
Figure 11-1. An example of the question-answering capability of Google
Historically, search engines were developed to cater to keyword-based queries, but due to their limitations, the engines also provided advanced search capabilities. This is acceptable in some verticals (for example, legal searches), but it is something that very few users find appealing. The user prefers a single box because of its ease of use and simplicity. Since the search box is open-ended (you can type whatever you want), users provide queries in natural language, using the linguistics of day-to-day life. Earlier Google had the advanced search option on its home page, but it was hidden in 2011. Google’s advanced search is now available in its home page settings and requires an extra click, or you need to go to www.google.com/advanced_search. Advanced search is where you provide details about the query terms and the fields to which they apply. It is generally preferred by librarians, lawyers, and medical practitioners.
Figure 11-2 shows an example of an intelligent search, performed by Amazon.com for the user query white formal shirt for men. The search engine finds exactly what you want by using built-in semantic capabilities.
Figure 11-2. An example of query intent mining in Amazon.com
Semantic search techniques perform a deep analysis of text by using technologies such as artificial intelligence, natural language processing, and machine learning. In this chapter, you will learn about a few of them and how to integrate them into Solr.
Semantic capabilities can be integrated in Solr while indexing and searching. If you are dealing with news, articles, blogs, journals, or e-mails, the data will be either unstructured or semistructured. In such cases, you should extract metadata and actionable information from the stream of text. Since unstructured data has been created for human consumption, it can be difficult for machines to interpret, but by using text-processing capabilities such as natural language processing, useful information can be extracted.
Semantic processing broadly depends on the following:
Tools
This section presents some of the tools and technologies that you might want to evaluate while processing text for semantic enrichment. You can extend a Solr component to plug in the tool that suits your text-processing requirements. Before proceeding further in this section, refer to the “Text Processing” section of Chapter 3 for a refresher on these concepts.
The Apache OpenNLP project provides a set of tools for processing natural language text for performing common NLP tasks such as sentence detection, tokenization, part-of-speech tagging and named-entity extraction, among others. OpenNLP provides a separate component for each of these tasks. The components can be used individually or combined to form a text-analytics pipeline. The library uses machine-learning techniques such as maximum entropy and the perceptron to train the models and build advanced text-processing capabilities. OpenNLP distributes a set of common models that perform well for general use cases. If you want to build a custom model for your specific needs, its components provide an API for training and evaluating the models.
This project is licensed under the Apache Software License and can be downloaded at https://opennlp.apache.org/. Other NLP libraries are available, such as Stanford NLP, but they are either not open source or require a GPL-like license that might not fit into the licensing requirements of many companies.
OpenNLP’s freely available models can be downloaded from http://opennlp.sourceforge.net/models-1.5/.
OpenNLP integration is not yet committed in Solr and is not available as an out-of-the-box feature. Refer to the Solr wiki at https://wiki.apache.org/solr/OpenNLP for more details. Later in this chapter, you will see examples of integrating OpenNLP with Solr.
Note Refer to JIRA https://issues.apache.org/jira/browse/LUCENE-2899 for more details on integration.
As you know, UIMA stands for Unstructured Information Management Architecture, an Apache project that allows you to develop interoperable, complex components and combine them to run together. This framework allows you to develop an analysis engine, which can be used to extract metadata and information from unstructured text.
The analysis engine allows you to develop a pipeline, which you can use to chain the annotators. Each annotator represents an independent component or feature. The annotators can consume and produce an annotation, and the output of one annotation can be input to the next in the chain. The chain can be formed by using XML configuration.
UIMA’s pluggable architecture, reusable components, and configurable pipeline allow you to throw away the monolithic structure and design a multistage process in which different modules need to build on each other to get a powerful analysis chain. This also allows you to scale out and run the components asynchronously. You might find the framework a bit complex; it has a learning curve.
Annotations from different vendors are available for consumption that can be added to the pipeline. Vendors such as Open Calias and AlchemyAPI provide a variety of annotations for text processing but require licenses.
Solr integration for UIMA is available as a contrib module, and Solr enrichments can be done with just a few configuration changes. Refer to the Solr official documentation at https://cwiki.apache.org/confluence/display/solr/UIMA+Integration for UIMA integration.
Apache Stanbol is an OSGi-based framework that provides a set of reusable components for reasoning and content enhancement. The additional benefit it offers is built-in CMS capabilities and provisions to persist semantic information such as entities and facts and define knowledge models.
No Solr plug-in is available, and none is required for this framework as Stanbol internally uses Solr as document repository. It also uses Apache OpenNLP for natural language processing and Apache Clerezza and Apache Jena as RDF and storage frameworks. Stanbol offers a GUI for managing the chains and offers additional features such as a web server and security features.
You might want to evaluate this framework if you are developing a system with semantic capabilities from scratch, as it offers you the complete suite.
Techniques Applied
Semantic search has been an active area of research for quite some time and is still not a solved problem, but lots of advancement has happened in the area. A typical example is IBM’s Watson; this intelligent system, capable of answering questions in a natural language, won the Jeopardy challenge in 2011. Building a semantic capability can be a fairly complex task, depending on what you want to achieve. But you can employ simple techniques to improve result quality, and sometimes a little semantics can take you a long way.
Figure 11-3 provides an overview of how semantic techniques can be combined with different knowledge bases to process input text and build the intelligence. The information gained from these knowledge bases can be used for expanding terms, indicating relationships among concepts, introducing facts, and extracting metadata from the input text. Chapter 3 provides an overview of these knowledge bases. In this section, you will see how to utilize this knowledge to perform a smart search.
Figure 11-3. Semantic techniques
Earlier in this book, you learned that Solr ranks a document by using a model such as the vector space model. This model considers a document as a bag of words. For a user query, it retrieves documents based on factors such as term frequency and inverse document frequency, but it doesn’t understand the relationships between terms. Semantic techniques such as these can be applied in Solr to retrieve more relevant results:
Figure 11-4 depicts how semantic capabilities can be applied to Solr for improving relevancy.
Figure 11-4. Application of semantic techniques in Solr
Next you will look at various natural language processing and semantic techniques that can be integrated in Solr for improving the precision of results.
Each word in a sentence can be classified into a lexical category, also called a part of speech. Common parts of speech include noun, verb, and adjective. These can be further classified—for instance, a noun can be categorized as a common noun or proper noun. This categorization and subcategorization information can be used to discover the significance of terms in context and can even be used to extract lots of interesting information about the words. I suggest that you get a fair understanding of parts of speech, as it may help you decipher the importance and purpose of words. For example, nouns are used to identify people, places, and things (for example, shirt or Joe), and adjectives define the attributes of a noun (such as red or intelligent). Similarly, the subclass, such as common noun, describes a class of entities (such as country or animal), and a proper noun describes the instances (such as America or Joe). Figure 11-5 provides sample text and its parts of speech. In Figure 11-5, the tags NNP, VBD, VBN and IN refer to proper noun (singular), verb (past tense), verb (past participle) and conjunction (preposition or subordinating) respectively.
Figure 11-5. Part-of-speech tagging
With an understanding of parts of speech, you can clearly make out that not all words are equally important. If your system can tag parts of speech, this knowledge can be used to control the document ranking based on the significance of the tokens in the context. Currently, while indexing documents, you boost either a document or a field, but ignore the fact that each of the terms may also need a different boost. In a sentence, generally the nouns and verbs are more important; you can extract those terms and index them to separate fields. This gives you a field with a smaller and more focused set of terms, which can be assigned a higher boost while querying. Solr features such as MoreLikeThis would work better on this field with more significant tokens. You can even apply a payload to them while indexing.
The part-of-speech tagging can be a necessary feature and a prerequisite for many types of advanced analysis. The semantic enrichment example, which you will see later in this chapter, requires POS tagged words, as the meaning and definition of a word may vary based on its part of speech. A POS tagger uses tags from the Penn Treebank Project to label words in sentences with their parts of speech.
In this section, you will learn to extract important parts of speech from a document being indexed and populate the extracted terms to a separate Solr field. This process requires two primary steps:
The steps to be followed for adding the desired part of speech to a separate Solr field are provided next in detail.
InputStream modelIn = new FileInputStream(fileName);
POSModel model = new POSModel(modelIn);
POSTaggerME tagger = new POSTaggerME(model);
String [] tokens = query.split(" ");
String [] tags = tagger.tag(tokens);
int i = 0;
List<PartOfSpeech> posList = new ArrayList<>();
for(String token : tokens) {
PartOfSpeech pos = new PartOfSpeech();
pos.setToken(token);
pos.setPos(tags[i]);
posList.add(pos);
i++;
}
private String modelFile;
private String src;
private String dest;
private float boost;
private List<String> allowedPOS;
private PartOfSpeechTagger tagger;
public void init(NamedList args) {
super.init(args);
SolrParams param = SolrParams.toSolrParams(args);
modelFile = param.get("modelFile"); src = param.get("src");
dest = param.get("dest"); boost = param.getFloat("boost", 1.0f);
String posStr = param.get("pos","nnp,nn,nns");
if (null != posStr) {
allowedPOS = Arrays.asList(posStr.split(",")); }
tagger = new PartOfSpeechTagger();
tagger.setup(modelFile);
};
@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
Object obj = doc.getFieldValue(src);
StringBuilder tokens = new StringBuilder();
if (null != obj && obj instanceof String) {
List<PartOfSpeech> posList = tagger.tag((String) obj);
for(PartOfSpeech pos : posList) {
if (allowedPOS.contains(pos.getPos().toLowerCase())) {
tokens.append(pos.getToken()).append(" ");
}
}
doc.addField(dest, tokens.toString(), boost);
}
// pass it up the chain
super.processAdd(cmd);
}
<lib dir="dir-containing-the-jar" regex=" solr-practical-approach-d.*.jar" />
<updateRequestProcessorChain name="nlp">
<processor class="com.apress.solr.pa.chapter11.opennlp.POSUpdateProcessorFactory">
<str name="modelFile">path-to-en-pos-maxent.bin</str>
<str name="src">description</str>
<str name="dest">keywords</str>
<str name="pos">nnp,nn,nns</str>
<float name="boost">1.4</float>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<str name="update.chain">nlp</str>
{
"id": "1201",
"description": "Bob Marley was born in Jamaica",
"keywords": "Bob Marley Jamaica "
}
Named-Entity Extraction
If your search engine needs to index unstructured content such as books, journals, or blogs, a crucial task is to extract the important information hidden in the stream of text. In this section, you will learn about different approaches to extract that information and leverage it to improve the precision and overall search experience.
Unstructured content is primarily meant for human consumption, and extracting the important entities and metadata hidden inside it requires complex processing. The entities can be generic information (for example, people, location, organization, money, or temporal information) or information specific to a domain (for example, a disease or anatomy in healthcare). The task of identifying the entities such as person, organization, and location is called named-entity recognition (NER). For example, in the text Bob Marley was born in Jamaica, NER should be able to detect Bob Marley as the person and Jamaica as the location. Figure 11-6 shows the entities extracted in this example.
Figure 11-6. Named entities extracted from content
The extracted named entities can be used in Solr in many ways, and the actual usage depends on your requirements. Some of the common uses in Solr are as follows:
The approaches for NER can be divided into three categories, detailed in the following subsections. The approach and its implementation depend on your requirements, use case, and the entity you are trying to extract. The first two approaches can be implemented by using the out-of-the-box features of Solr or any advanced NLP library. For the third approach, you will customize Solr to integrate OpenNLP and extract named entities. The customization or extraction can vary as per your needs.
Using Rules and Regex
Rules and regular expressions are the simplest approach for NER. You can define a set of rules and a regex pattern, which is matched against the incoming text for entity extraction. This approach works well for extraction of entities that follow a predefined pattern, as in the case of e-mail IDs, URLs, phone numbers, zip codes, and credit card numbers.
A simple regex can be integrated by using PatternReplaceCharFilterFactory or PatternReplaceTokenFilterFactory in the analysis chain. A regex for determining phone numbers can be as simple as this:
^[0-9+()#.s/ext-]+$
For extracting e-mail IDs, you can use UAX29URLEmailTokenizerFactory provided by Lucene. Refer to http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory for details of the tokenizer. You may find it interesting that there is an official standard regex for e-mail, known as RFC 5322. Refer to http://tools.ietf.org/html/rfc5322#section-3.4 for details. It describes the syntax that valid e-mail addresses must adhere to, but it’s too complicated to implement.
If you are looking for complex regex rules, you can evaluate the Apache UIMA-provided regular expression annotator, where you can define the rule set. Refer to https://uima.apache.org/downloads/sandbox/RegexAnnotatorUserGuide/RegexAnnotatorUserGuide.html for details of the annotator. If you are using a rule engine such as Drools, you can integrate it into Solr by writing a custom update processor or filter factory.
If you want to use OpenNLP for regex-based NER, you can use RegexNameFinder and specify the pattern instead of using NameFinderME. Refer to the example in the section “Using a Trained Model”, where you can do the substitution by using RegexNameFinder.
The limitation of this NER approach is that anything unrelated that follows the specified pattern or satisfies the rule will be detected as a valid entity; for example, a poorly formatted five-digit salary could be detected as a zip code. Also, this approach is limited to entity types that follow a known pattern. It cannot be used to detect entities such as name or organization.
Using a Dictionary or Gazetteer
The dictionary-based approach, also called a gazetteer-based approach, of entity extraction maintains a list of terms for the applicable category. The input text is matched upon the gazetteer for entity extraction. This approach works well for entities that are applicable to a specific domain and have a limited set of terms. Typical examples are job titles in your organization, nationality, religion, days of week, or months of year.
You can build the list by extracting information from your local datasource or external source (such as Wikipedia). The data structure for maintaining the dictionary can be anything that fits your requirements. It can be as simple as a Java collection populated from a text file. An easier implementation for a file-based approach to populate the entities in a separate field can be to use Solr’s KeepWordFilterFactory. The following is a sample text analysis for it:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
</analyzer>
The list can be maintained in database tables, but a large list is likely to impair performance. A faster and performance-efficient approach is to build an automaton. You can refer to the FST data structure in the Solr suggestions module or one provided by David Smiley at https://github.com/OpenSextant/SolrTextTagger.
If you want to look up phrases along with individual terms, the incoming text can be processed by using either the OpenNLP Chunker component or ShingleFilterFactory provided by Solr. The following is the filter definition for generating shingles of different sizes. The parameter outputUnigrams="true" has been provided to match the single tokens also.
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
The benefit of this approach is that it doesn’t require training but is less popular as it’s difficult to maintain. It cannot be used for common entities such as name or organization, as the terms can be ambiguous and not limited to a defined set of values. This approach also ignores the context. For example, this approach cannot differentiate whether the text Tommy Hilfiger refers to a person or an organization.
OpenNLP offers a better approach for dictionary-based extraction by scanning for names inside the dictionary and letting you not worry about matching phrases.
Here are the steps for NER in OpenNLP using a dictionary:
<dictionary case_sensitive="false">
<entry ref="director">
<token>Director</token>
</entry>
<entry ref="producer">
<token>Producer</token>
</entry>
<entry ref="music director">
<token>Music</token><token>Director</token>
</entry>
<entry ref="singer">
<token>Singer</token>
</entry>
</dictionary
InputStream modelIn = new FileInputStream(file);
Dictionary dictionary = new Dictionary(modelIn);
Alternatively, the Dictionary object can be created using a no-arg constructor and tokens added to it as shown here:
Dictionary dictionary = new Dictionary();
dictionary.put(new StringList("Director"));
dictionary.put(new StringList("Producer"));
dictionary.put(new StringList("Music", "Director"));
dictionary.put(new StringList("Singer"));
DictionaryNameFinder dnf = new DictionaryNameFinder(dictionary, "JobTitles");
Refer to the implementation steps in the following “Using a Trained Model” section, as the rest of the steps remain the same.
You can use a hybrid of a rules- and gazetteer-based approach if you feel that it can improve precision.
Using a Trained Model
The approach of using trained models for NER falls under the supervised learning category of machine learning: human intervention is required to train the model, but after the model is trained, it returns a nearly accurate result.
This approach uses a statistical model for extracting entities. It is preferred for extracting entities that are not limited to a set of values, such as in case of name or organization. This approach can find entities that are not defined or tagged in the model. This model considers the semantics and context of the text and easily resolves the ambiguity between entities, as in the case of person name and organization name. These problems cannot be addressed using the earlier approach; the trained model is the only way to go. It doesn’t require creating large dictionaries that are difficult to maintain.
Solr Plug-in for Entity Extraction
In this section, you will learn to extract the named entities from documents being indexed and populate the extracted terms to a separate Solr field. This process requires two primary steps:
Here are the detailed steps to be followed for adding the extracted named entities to a separate field:
InputStream modelIn = new FileInputStream(fileName);TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
NameFinderME nameFinder = new NameFinderME(model);
String [] sentence = query.split(" ");
Span[] spans = nameFinder.find(sentence);
List<NamedEntity> neList = new ArrayList<>();
for (Span span : spans) {
NamedEntity entity = new NamedEntity();
StringBuilder match = new StringBuilder();
for (int i = span.getStart(); i < span.getEnd(); i++) {
match.append(sentence[i]).append(" ");
}
entity.setToken(match.toString().trim.()); entity.setEntity(entityName);
neList.add(entity);
}
nameFinder.clearAdaptiveData();
private String modelFile;
private String src;
private String dest;
private String entity;
private float boost;
private NamedEntityTagger tagger;
public void init(NamedList args) {
super.init(args);
SolrParams param = SolrParams.toSolrParams(args);
modelFile = param.get("modelFile");
src = param.get("src");
dest = param.get("dest");
entity = param.get("entity","person");
boost = param.getFloat("boost", 1.0f);
tagger = new NamedEntityTagger();
tagger.setup(modelFile, entity);
};
@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
Object obj = doc.getFieldValue(src); if (null != obj && obj instanceof String) {
List<NamedEntity> neList = tagger.tag((String) obj);
for(NamedEntity ne : neList) {
doc.addField(dest, ne.getToken(), boost);
}
} super.processAdd(cmd);
}
<lib dir="dir-containing-the-jar" regex=" solr-practical-approach-d.*.jar" />
<updateRequestProcessorChain name="nlp">
<processor class="com.apress.solr.pa.chapter11.opennlp.NERUpdateProcessorFactory">
<str name="modelFile">path-to-en-ner-person.bin</str>
<str name="src">description</str>
<str name="dest">ext_person</str>
<str name="entity">person</str>
<float name="boost">1.8</float>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<str name="update.chain">nlp</str>
{
"id": "1201",
"description": "Bob Marley and Ricky were born in Jamaica",
"ext_person": [
"Bob Marley ",
"Ricky "
]
}
This source code extracts only one type of entity. In real-life scenarios, you may want to extract multiple entities and populate them to different fields. You can extend this update request processor to accommodate the required capabilities by loading multiple models and adding the extracted terms to separate fields.
OpenNLP uses a separate model for each entity type. In contrast, Stanford NLP, another NLP package, uses the same model for all entities.
This approach of entity extraction is supposed to return accurate results, but the output also depends on tagging quality and the data on which the model is trained (remember, garbage in, garbage out). Also, the model-based approach can be costly in terms of memory requirements and processing speed.
Semantic Enrichment
In Chapter 4, you learned to use SynonymFilterFactory to generate synonyms for expanding the tokens. The primary limitation of this approach is that it doesn’t consider semantics. A word can be polysemous (have multiple meanings), and synonyms can vary based on their part of speech or the context. For example, a typical generic synonym.txt file can have the word large specified as a synonym for big, and this would expand the query big brother as large brother, which is semantically incorrect.
Instead of using a text file that defines the list of terms and its synonyms, more sophisticated approaches can be applied using controlled vocabularies such as WordNet or Medical Subject Headings (MeSH), which require no manual definition or handcrafting. Synonym expansion is just one part of it. The vocabularies and thesauri also contain other useful information such as hypernyms, hyponyms, and meronyms, which you will learn about further. This information can be used for understanding the semantic relationship and expanding the query further.
These thesauri are generally maintained by a community or an enterprise that keeps updating the corpus with the latest words. The vocabulary can be generic or it can be specific to a particular domain. WordNet is an example of a generic thesaurus that can be incorporated into any search engine and for any domain. MeSH is a vocabulary applicable to a medical domain.
You can also perform semantic enrichment by building taxonomies and ontologies containing the concept tree applicable to your domain. The query terms can be matched against a taxonomy for extracting broader, narrower, or related concepts (such as altLabel or prefLabel) and perform the required enrichment. You can get some of these taxonomies over the Web, or you can have a taxonomist define one for you. You can also consume resource such as DBpedia that provide structured information extracted from Wikipedia as RDF triples. Wikidata is another linked database which contains structured data from Wikimedia projects including Wikipedia.
These knowledge bases can be integrated into Solr in various ways. Here are the ways to plug the desired enrichment into Solr:
Figure 11-7. Concept tree
In this section, you will learn how to automatically expand terms to include their synonyms. This process requires implementing the following two tasks:
Before performing these tasks, you’ll learn about the basics of WordNet and the information it offers, and then get hands-on with synonymy expansion.
WordNet is a large lexical database of English. It groups nouns, verbs, adjectives, and adverbs into sets of cognitive synonyms called synsets, each expressing a distinct concept. Synsets are interlinked by means of conceptual, semantic, and lexical relationships. Its structure makes it a useful tool for computational linguistics and natural language processing. It contains 155,287 words, organized in 117,659 synsets, for a total of 206,941 word-sense pairs.
WordNet is released under a BSD-style license and is freely available for download from its web site, https://wordnet.princeton.edu/. You can also evaluate the online version of the thesaurus at http://wordnetweb.princeton.edu/perl/webwn.
The main relationships among words in WordNet is synonymy, as its synset groups words that denote the same concept and are interchangeable in many contexts. Apart from synonyms, the thesaurus also contains the following primary information (among others):
WordNet requires the part of speech along with the words as a prerequisite.
A handful of Java libraries are available for accessing WordNet, each with its own pros and cons. You can refer to http://projects.csail.mit.edu/jwi/download.php?f=finlayson.2014.procgwc.7.x.pdf for a paper that compares the features and performance of the primary libraries.
Solr Plug-in for Synonym Expansion
This section provides steps for developing a mechanism for a simple expansion of terms. Below are the two steps needed to implement the feature
Synonym Expansion Using WordNet
To extract synonyms from WordNet, you have two prerequisites:
This steps provided are in their simplest form; you may need optimizations to make the code production ready. Also, the program extracts only synonyms. You can extend it to extract other related information discussed previously. Another thing to note while extracting related terms from a generic thesaurus such as WordNet is that you may want to perform disambiguation to identify the appropriate synset for the term being expanded; words such as bank, for example, are polysemous and can mean river bank or banking institution, depending on the context. If you are using a domain-specific vocabulary, the disambiguation will not be that important. (Covering disambiguation is beyond the scope of this book.)
Here are the steps for synonym expansion using WordNet:
<?xml version="1.0" encoding="UTF-8"?>
<jwnl_properties language="en">
<version publisher="Princeton" number="3.0" language="en"/>
<dictionary class="net.didion.jwnl.dictionary.FileBackedDictionary">
<param name="dictionary_element_factory" value="net.didion.jwnl.princeton.data.PrincetonWN17FileDictionaryElementFactory"/>
<param name="file_manager"
value="net.didion.jwnl.dictionary.file_manager.FileManagerImpl">
<param name="file_type"
value="net.didion.jwnl.princeton.file.PrincetonRandomAccessDictionaryFile"/>
<param name="dictionary_path" value="/path/to/WordNet/dictionary"/>
</param>
</dictionary>
<resource class="PrincetonResource"/>
</jwnl_properties>
JWNL.initialize(new FileInputStream(propFile));
Dictionary dictionary = Dictionary.getInstance();
POS pos = null;
switch (posStr) {
case "VB":
case "VBD":
case "VBG":
case "VBN":
case "VBP":
case "VBZ":
pos = POS.VERB;
break;
case "RB":
case "RBR":
case "RBS":
pos = POS.ADVERB;
break;
case "JJS":
case "JJR":
case "JJ":
pos = POS.ADJECTIVE;
break;
// case "NN":
// case "NNS":
// case "NNP":
// case "NNPS":
// pos = POS.NOUN;
// break;
}
IndexWord word = dictionary.getIndexWord(pos, term);
Set<String> synonyms = new HashSet<>();
Synset[] synsets = word.getSenses();
for (Synset synset : synsets) {
Word[] words = synset.getWords();
for (Word w : words) {
String synonym = w.getLemma().toString()
.replace("_", " ");
synonyms.add(synonym);
}
}
Custom Token Filter for Synonym Expansion
In the previous section, you learned to extract synonyms from WordNet. Now the extracted synonyms have to be added to the terms being indexed. In this section, you will learn to write a custom token filter that can be added to a field’s text-analysis chain to enrich the tokens with its synonyms.
Lucene defines classes for a token filter in the package org.apache.lucene.analysis.*. For writing your custom token filter, you will need to extend the following two Lucene classes:
The following are the steps to be followed for writing a custom token filter and plugging it into a field’s text-analysis chain:
public class CVSynonymFilterFactory extends TokenFilterFactory implements ResourceLoaderAware {
}
@Override
public TokenStream create(TokenStream input) {
return new CVSynonymFilter(input, dictionary, tagger, maxExpansion);
}
public CVSynonymFilterFactory(Map<String, String> args) {
super(args);
maxExpansion = getInt(args, "maxExpansion", 3);
propFile = require(args, "wordnetFile");
modelFile = require(args, "posModel");}
@Override
public void inform(ResourceLoader loader) throws IOException {
// initialize for wordnet
try {
JWNL.initialize(new FileInputStream(propFile));
dictionary = Dictionary.getInstance();
} catch (JWNLException ex) {
logger.error(ex.getMessage());
ex.printStackTrace();
throw new IOException(ex.getMessage());
}
// initialize for part of speech tagging
tagger = new PartOfSpeechTagger();
tagger.setup(modelFile);
}
private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class);
private final PositionIncrementAttribute posIncrAttr
= addAttribute(PositionIncrementAttribute.class);
private final PositionLengthAttribute posLenAttr
= addAttribute(PositionLengthAttribute.class);
private final TypeAttribute typeAttr = addAttribute(TypeAttribute.class);
private final OffsetAttribute offsetAttr = addAttribute(OffsetAttribute.class);
public CVSynonymFilter(TokenStream input,
Dictionary dictionary, PartOfSpeechTagger tagger, int maxExpansion) {
super(input); this.maxExpansion = maxExpansion;
this.tagger = tagger;
this.vocabulary = new WordnetVocabulary(dictionary);
if (null == tagger || null == vocabulary) { throw new IllegalArgumentException("fst must be non-null");
}
pendingOutput = new ArrayList<String>();
finished = false; startOffset = 0;
endOffset = 0;
posIncr = 1;
}
@Override
public boolean incrementToken() throws IOException {
while (!finished) {
// play back any pending tokens synonyms
while (pendingTokens.size() > 0) {
String nextToken = pendingTokens.remove(0);
termAttr.copyBuffer(nextToken.toCharArray(), 0, nextToken.length());
offsetAttr.setOffset(startOffset, endOffset);
posIncAttr.setPositionIncrement(posIncr);
posIncr = 0;
return true;
}
// extract synonyms for each token
if (input.incrementToken()) {
String token = termAttr.toString();
startOffset = offsetAttr.startOffset();
endOffset = offsetAttr.endOffset();
addOutputSynonyms(token);
} else {
finished = true;
}
}
// should always return false
return false;}
private void addOutputSynonyms(String token) throws IOException { pendingTokens.add(token);
List<PartOfSpeech> posList = tagger.tag(token);
if (null == posList || posList.size() < 1) {
return;
}
Set<String> synonyms = vocabulary.getSynonyms(token, posList.get(0)
.getPos(), maxExpansion);
if (null == synonyms) {
return;
}
for (String syn : synonyms) {
pendingTokens.add(syn);
}}
@Override
public void reset() throws IOException {
super.reset();
finished = false;
pendingTokens.clear();
startOffset = 0;
endOffset = 0;
posIncr = 1;}
The processing provided here tags the part of speech for each token instead of the full sentence, for simplicity. Also, the implementation is suitable for tokens and synonyms of a single word. If you are thinking of getting this feature to production, I suggest you refer to SynonymFilter.java in the org.apache.lucene.analysis.synonym package and extend it using the approach provided there.
<lib dir="./lib" />
<fieldType name="text_semantic" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="com.apress.solr.pa.chapter11.enrichment.CVSynonymFilterFactory" maxExpansion="3" wordnetFile="path-of-jwnl-properties.xml" posModel="path-to-en-pos-maxent.bin" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Summary
In this chapter, you learned about the semantic aspects of search engines. You saw the limitations of the keyword-based engines and learned ways in which semantic search enhances the user experience and findability of documents. Semantic search is an advanced and broad topic. Given the limited scope of this book, we focused on simple natural language processing techniques for identifying important words in a sentence and approaches for extracting metadata from unstructured text. You also learned about a basic semantic enrichment technique for discovering documents that were totally ignored earlier but could be of great interest to the user. To put it all together, this chapter provided sample source code for integrating these features in Solr.
Here you’ve come to the end of this book. I sincerely hope that its content is useful in your endeavor of developing a practical search engine and that it contribute to your knowledge of Apache Solr.
3.144.39.16