Field guessing

If you open solrconfig.xml for the new schemaless project that you have created, you will see there is a section for UpdateRequestProcessorChain. This is primarily used to automatically apply some operations to the documents before they get indexed and helps in field guessing. We will see some of the snippets from solrconfig.xml now:

<updateProcessor class="solr.RemoveBlankFieldUpdateProcessorFactory" name="remove-blank"/>

The previous plugin will remove any blanks from indexing:

<updateProcessor class="solr.ParseBooleanFieldUpdateProcessorFactory" name="parse-boolean"/>
 <updateProcessor class="solr.ParseLongFieldUpdateProcessorFactory" name="parse-long"/>
 <updateProcessor class="solr.ParseDoubleFieldUpdateProcessorFactory" name="parse-double"/>
 <updateProcessor class="solr.ParseDateFieldUpdateProcessorFactory" name="parse-date">
     <arr name="format">
         <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
         <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
         <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>
         <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>
         <str>yyyy-MM-dd'T'HH:mm:ssZ</str>
         <str>yyyy-MM-dd'T'HH:mm:ss</str>
         <str>yyyy-MM-dd'T'HH:mmZ</str>
         <str>yyyy-MM-dd'T'HH:mm</str>
         <str>yyyy-MM-dd HH:mm:ss.SSSZ</str>
         <str>yyyy-MM-dd HH:mm:ss,SSSZ</str>
         <str>yyyy-MM-dd HH:mm:ss.SSS</str>
         <str>yyyy-MM-dd HH:mm:ss,SSS</str>
         <str>yyyy-MM-dd HH:mm:ssZ</str>
         <str>yyyy-MM-dd HH:mm:ss</str>
         <str>yyyy-MM-dd HH:mmZ</str>
         <str>yyyy-MM-dd HH:mm</str>
         <str>yyyy-MM-dd</str>
     </arr>
 </updateProcessor>

Here, you can see that we have added many update request processors to parse various field types. In the case of dates, we can also specify various patterns of a date that we can interpret:

<updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
     <lst name="typeMapping">
         <str name="valueClass">java.lang.String</str>
         <str name="fieldType">text_general</str>
         <lst name="copyField">
             <str name="dest">*_str</str>
             <int name="maxChars">256</int>
         </lst>
         <!-- Use as default mapping instead of defaultFieldType -->
         <bool name="default">true</bool>
     </lst>
     <lst name="typeMapping">
         <str name="valueClass">java.lang.Boolean</str>
         <str name="fieldType">booleans</str>
     </lst>
     <lst name="typeMapping">
         <str name="valueClass">java.util.Date</str>
         <str name="fieldType">pdates</str>
     </lst>
     <lst name="typeMapping">
         <str name="valueClass">java.lang.Long</str>
         <str name="valueClass">java.lang.Integer</str>
         <str name="fieldType">plongs</str>
     </lst>
     <lst name="typeMapping">
         <str name="valueClass">java.lang.Number</str>
         <str name="fieldType">pdoubles</str>
     </lst>
 </updateProcessor>

In the preceding snippet, you can see that we have assigned a field type to the fields that we parsed. The default field type is String, as you can see highlighted in bold. We also made the copy field rule to copy the data to text_general_str with a max of 256 characters:

 <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
 processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
     <processor class="solr.LogUpdateProcessorFactory"/>
     <processor class="solr.DistributedUpdateProcessorFactory"/>
     <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

Finally, we add an updateRequestProcessChain to add all the predefined processors and make it default.

Table of Contents for Field guessing

Create new playlist

Sign In

Sign Up

Table of Contents for
Field guessing