Time for action – refactoring the schema.xml file for the paintings core by introducing tokenization and stop words

We will rewrite the configuration in order to make it adaptable to real-world text, introducing stop words and a common tokenization of words:

  1. Starting from a copy of the schema designed before, we added two new field types in the <types> section:
    <fieldType name="text_general" class="solr.TextField">
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt" />
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>
    
    <fieldType name="url_text" class="solr.TextField">
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="0"catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>
  2. We can then simply add some fields to the default ones using the new fields we have defined:
    <field name="artist" type="url_text" indexed="true" stored="true" multiValued="false" />
    <field name="title" type="text_general" indexed="true" stored="true" multiValued="false" />
    <field name="museum" type="url_text" indexed="true"stored="true" multiValued="false" />
    <field name="city" type="url_text" indexed="true" stored="true" multiValued="false" />
    <field name="year" type="string" indexed="true" stored="true" multiValued="false" />
    <field name="abstract" type="text_general" indexed="true" stored="true" multiValued="true" />
    <field name="wikipedia_link" type="url_text" indexed="true" stored="true" multiValued="true" />

We have basically introduced a filter for filtering out certain recurring words for improving searches on the text_general field, and a new url_text designed specifically for handling URL strings.

What just happened?

The string type is intended to be used for representing a unique textual token or term that we don't want to split into smaller terms, so we use it for representing only the year and uri fields. We didn't use the date format provided by Solr because it is intended to be used for range/period queries over dates and uses a specific format. We will see these kind of queries later. On analyzing our dates, we found that in some cases, the year field contains values that describe an uncertain period such as 1502-1509 or ~1500, so we have to use the string value for this type.

On the other hand, for the fields containing normal text, we used the type text_general that we had defined and analyzed it for the first version of the schema. We also introduced StopFilterFactory into the analyzer chain. This token filter is designed to exclude some terms that are not interesting for a search from the token list. The typical examples are articles such as the or offensive words. In order to intercept and ignore these terms, we can define a list of these in an apposite text file called stopwords.txt, line by line. By ignoring case, it is possible to have more flexibility, and enablePositionIncrements=true is used for ignoring certain terms, maintaining a trace of their position between other words. This is useful to perform queries such as author of Mona Lisa, which we will see when we talk about phrase query.

Lastly, there are several values in our data that are uri, but we need to treat them as values for our searches. Think about, for example, the museum field of our first example entity, http://dbpedia.org/resource/Louvre. The uri_text field can work because we defined an analyzer that first normalizes all the accented characters (for example in French terms) using a particular character filter called MappingCharFilter . It's important to provide queries that are robust enough to find terms with digitization and without the right accents, especially when dealing with foreign languages, so the normalization process replaces the accented letter with the corresponding one without accent. This filter needs a text file to define explicit mappings for the mapping-ISOLatin1Accent.txt character substitution, which should be written with a UTF-8 encoding to avoid problems.

The field type analyzer uses a WhitespaceTokenizer (we probably could have used a KeywordTorkenizer here obtaining the same result), and then two token filters, WordDelimiterFilterFactory that splits the part of an uri, and the usual LowerCaseFilterFactory for handling terms ignoring cases. The WordDelimiterFactory filter is used to index every part of the uri since the filter splits on the / character into multiple parts, and we decided not to concatenate them but to generate a new part for every token.

Using common field attributes for different use cases

A combination of the true and false values for the attributes of a field typically has an impact on different functionalities. The following schema suggests common values to adopt for a certain functionality to work properly:

Use Case

Indexed

Stored

Multi Valued

searching within field

true

  

retrieving contents

 

true

 

using as unique key

true

 

false

adding multiple values and maintaining order

  

true

sorting on field

true

 

false

highlighting

true

true

 

faceting

true

  

In the schema you will also find some of the features that we will see in further chapters; it is just to give you an idea in advance on how to manage the predefined values for fields that are designed to be used in specific contexts.

For an exhaustive list of the possible configurations, you can read the following page from the original wiki: http://wiki.apache.org/solr/FieldOptionsByUseCase.

Testing the paintings schema

Using the command seen before, we can add a test document:

>> curl -X POST 'http://localhost:8983/solr/paintings/update?commit=true&wt=json' -H 'Content-type:application/json' -d '
[
  {
    "uri" : "http://dbpedia.org/resource/Mona_Lisa",
    "title" : "Mona Lisa",
    "artist" : "http://dbpedia.org/resource/Leonardo_Da_Vinci",
    "museum" : "http://dbpedia.org/resource/Louvre"
  }
]'

Then, we would like to search for something using the term lisa and be able to retrieve the Mona Lisa document:

>> curl -X GET 'http://localhost:8983/solr/paintings/select?q=museum:Louvre&wt=json&indent=true' -H 'Content-type:application/json' 

Now that we have a working schema, it's finally time to collect the data we need for our examples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.154.139