We will rewrite the configuration in order to make it adaptable to real-world text, introducing stop words and a common tokenization of words:
<types>
section:<fieldType name="text_general" class="solr.TextField"> <analyzer> <charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt" /> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <fieldType name="url_text" class="solr.TextField"> <analyzer> <charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt" /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="0"catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
<field name="artist" type="url_text" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text_general" indexed="true" stored="true" multiValued="false" /> <field name="museum" type="url_text" indexed="true"stored="true" multiValued="false" /> <field name="city" type="url_text" indexed="true" stored="true" multiValued="false" /> <field name="year" type="string" indexed="true" stored="true" multiValued="false" /> <field name="abstract" type="text_general" indexed="true" stored="true" multiValued="true" /> <field name="wikipedia_link" type="url_text" indexed="true" stored="true" multiValued="true" />
We have basically introduced a filter for filtering out certain recurring words for improving searches on the text_general
field, and a new url_text
designed specifically for handling URL strings.
The string
type is intended to be used for representing a unique textual token or term that we don't want to split into smaller terms, so we use it for representing only the year
and uri
fields. We didn't use the date format provided by Solr because it is intended to be used for range/period queries over dates and uses a specific format. We will see these kind of queries later. On analyzing our dates, we found that in some cases, the year field contains values that describe an uncertain period such as 1502
-1509
or ~1500
, so we have to use the string
value for this type.
On the other hand, for the fields containing normal text, we used the type text_general
that we had defined and analyzed it for the first version of the schema. We also introduced StopFilterFactory
into the analyzer chain. This token filter is designed to exclude some terms that are not interesting for a search from the token list. The typical examples are articles such as the
or offensive words. In order to intercept and ignore these terms, we can define a list of these in an apposite text file called stopwords.txt
, line by line. By ignoring case, it is possible to have more flexibility, and enablePositionIncrements=true
is used for ignoring certain terms, maintaining a trace of their position between other words. This is useful to perform queries such as author of Mona Lisa
, which we will see when we talk about phrase query.
Lastly, there are several values in our data that are uri
, but we need to treat them as values for our searches. Think about, for example, the museum
field of our first example entity, http://dbpedia.org/resource/Louvre. The uri_text
field can work because we defined an analyzer that first normalizes all the accented characters (for example in French terms) using a particular character filter called MappingCharFilter
. It's important to provide queries that are robust enough to find terms with digitization and without the right accents, especially when dealing with foreign languages, so the normalization process replaces the accented letter with the corresponding one without accent. This filter needs a text file to define explicit mappings for the mapping-ISOLatin1Accent.txt
character substitution, which should be written with a UTF-8 encoding to avoid problems.
The field type analyzer uses a WhitespaceTokenizer
(we probably could have used a KeywordTorkenizer
here obtaining the same result), and then two token filters, WordDelimiterFilterFactory
that splits the part of an uri, and the usual LowerCaseFilterFactory
for handling terms ignoring cases. The WordDelimiterFactory
filter is used to index every part of the uri
since the filter splits on the /
character into multiple parts, and we decided not to concatenate them but to generate a new part for every token.
A combination of the true and false values for the attributes of a field typically has an impact on different functionalities. The following schema suggests common values to adopt for a certain functionality to work properly:
Use Case |
Indexed |
Stored |
Multi Valued |
---|---|---|---|
searching within field |
true | ||
retrieving contents |
true | ||
using as unique key |
true |
false | |
adding multiple values and maintaining order |
true | ||
sorting on field |
true |
false | |
highlighting |
true |
true | |
faceting |
true |
In the schema you will also find some of the features that we will see in further chapters; it is just to give you an idea in advance on how to manage the predefined values for fields that are designed to be used in specific contexts.
For an exhaustive list of the possible configurations, you can read the following page from the original wiki: http://wiki.apache.org/solr/FieldOptionsByUseCase.
Using the command seen before, we can add a test document:
>> curl -X POST 'http://localhost:8983/solr/paintings/update?commit=true&wt=json' -H 'Content-type:application/json' -d ' [ { "uri" : "http://dbpedia.org/resource/Mona_Lisa", "title" : "Mona Lisa", "artist" : "http://dbpedia.org/resource/Leonardo_Da_Vinci", "museum" : "http://dbpedia.org/resource/Louvre" } ]'
Then, we would like to search for something using the term lisa
and be able to retrieve the Mona Lisa
document:
>> curl -X GET 'http://localhost:8983/solr/paintings/select?q=museum:Louvre&wt=json&indent=true' -H 'Content-type:application/json'
Now that we have a working schema, it's finally time to collect the data we need for our examples.
3.144.97.126