One of the essential parts of the solrconfig,xml
file will be the definition of an update handler for posting documents.
<config> <luceneMatchVersion>LUCENE_45</luceneMatchVersion> <directoryFactory name='DirectoryFactory' class='solr.MMapDirectoryFactory' /> <requestHandler name='standard' class='solr.StandardRequestHandler' default='true' /> <requestHandler name='/update' class='solr.UpdateRequestHandler' /> <requestHandler name='/admin/' class='org.apache.solr.handler.admin.AdminHandlers' /> <admin> <defaultQuery>*:*</defaultQuery> </admin> </config>
/update
handler, in order to receive the posted documents and index them.This particular solrconfig.xml
file configuration, as you see, only defines the default handler for accepting queries, and it generally will expose the simplest update handler. Using this handler in conjunction with our schemaless approach, let us index basically every new field posted to Solr.
The best way to test if a core is working correctly is to post data to it and verify the results of simple queries. Then, to see if everything is working correctly, we can post some dummy data to our new index:
>> curl -X POST 'http://localhost:8983/solr/pdfs/update?commit=true&wt=json' -H 'Content-Type: application/xml' -d '<add><doc><field name='title'>Dummy Test Document</field><field name='text'>Hello World</field></doc></add>'
Now we can easily see if the data has been correctly indexed:
>> curl -X GET 'http://localhost:8983/solr/pdfs/select?q=*:*&wt=json&indent=true'
Once we have added at least two/three documents, repeat this process and change the example values to search for the documents containing the term hello
. How do we search for it? Remember that the parameter q=*:*
queries all the fields for every term; a basic search for hello
can be written as q=fullText:hello
or even replace with q=fulltext:'hello'
, where the double quotation is used for querying on an exact sequence of words (we will see it later with a phrase search), but is unnecessary now:
>> curl -X GET 'http://localhost:8983/solr/pdfs/select?q=fullText:hello&wt=json&indent=true'
If we had simply written q=hello
, defaultSearchField
will be used. Note that searching for the term hello
will not work at the moment, since we have adopted a string type for the fields, which indexes the textual values as a unique term. If we want this example to work correctly, we need to split the text into two different terms (to find Hello
as a single term), and ignore the case (to be able to recognize Hello
as a match for hello
). We will add those capabilities in the next section.
In this section we will introduce a simple analyzer configuration for a fulltext field.
We can now update our schema.xml
file, defining a new field type that we will call text
. This internally uses two very common elements for the types, a tokenizer
and a filter:
<types> ... <fieldtype name='text' class='solr.TextField'> <analyzer> <tokenizer class='solr.WhitespaceTokenizerFactory' /> <filter class='solr.LowerCaseFilterFactory' /> </analyzer> </fieldtype> … </types>
The WhitespaceTokenizerFactory
is one of the most common tokenizers. A tokenizer is a component that splits an entire text fragment (a string) into several terms. This particular tokenizer splits terms by whitespace, for example, the string Hello World
will be split into the terms Hello
and World
.
The LowerCaseFilterFactory is a filter that transforms the case of all the terms to lowercase.
Using these two combined and yet simple components, we are able to perform better searches. The best place to use this new field type in our example is probably the fullText
field itself:
<field name='fullText' type='text' multiValued='true' />
Now we should have a chance to search for a simple term hello
:
With this simple addition, we are now able to perform queries searching only a single term.
18.189.170.134