Time for action – importing Solr XML document files

In the previous chapters of this book, we had the chance to produce and save several XML files containing well-formatted Solr documents. A simple idea would be to configure a DataImportHandler to be able to acquire new Solr documents by a specific folder, where we will put them as soon as they are produced.

The basic part of the new configuration will be very similar to the previous one:

<dataSource name="fds" encoding="UTF-8" type="FileDataSource" />
<document>
  <entity name="files" processor="FileListEntityProcessor" 
    baseDir="${solr.solr.home}/../../resources" 
    fileName=".*.xml">

    <entity pk="uri" 
      name="file" 
      format="text" 
      processor="XPathEntityProcessor" 
      url="${files.fileAbsolutePath}" 
      useSolrAddSchema="true" 
      rootEntity="true" 
      dataSource="fds" 
      stream="true"
      onError="skip" />
    
  </entity>
</document>

You can find the full configuration at /SolrStarterBook/solr-app/chp08/solr_docs.

What just happened?

In this case, we are listing the saved XML files representing Solr documents that were created from our initial DBpedia resources, and then we process every single XML code with a specific XPathEntityProcessor. This processor can be used by default to emit an XML result in the Solr format (here, we are also explicitly specifying this option by using useSolrAddSchema="true"; but it's not mandatory as it is the default). Because we adopted this default format, we really don't have to specify lot of things. For example, we can skip files that cause an error (onError="skip"), every resource handled by this processor will coincide to a root entity (rootEntity="true"). Most important of all, we adopted the stream="true" parameter in order to achieve a minor impact on performance, because the XML files can be too resource intensive to be loaded fully into memory by using XPathEntityProcessor with XML

Once we understand how the XPathEntityProcessor works in practice, it's also simple to use it on a very wide variety of XML files. Let's look at how to index the dataset dump of Wikipedia. This is described in the official wiki section at: http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia. This example will require some time due to the huge amount of data; but if you follow the examples, you should be able to apply it to your own experiments. So, I suggest you to play with it as the next step.

Just to give you an idea, a similar extraction of data can also be performed directly on the RDF/XML files we previously downloaded from DBpedia. A simple skeleton of a configuration for the root entity could be similar to the following code:

<entity name="file" 
  processor="XPathEntityProcessor" 
  pk="uri" 
  dataSource="files" 
  stream="true" 
  forEach="/RDF/Description" 
  onError="skip" 
  rootEntity="true" 
  url="${files.fileAbsolutePath}" 
  xsl="xslt/post.xsl"   
  transformer="script:console" >

  <field column="artist" xpath="/RDF/Description/title" commonField="true" />
  <field column="subject" xpath="/RDF/Description/subject" commonField="true" />
  <field column="uri" xpath="/RDF/Description/title" commonField="true" />
  <field column="title" xpath="/RDF/Description/title" />
</entity>

As you can see, it's simple to iterate over a list of root elements in the file (for example on all the /RDF/Description elements), specifying a forEach parameter. Then, we can project a value for every field by simply pointing it with its own XPath expression. What is interesting here is that we are currently not using the explicit form /rdf:RDF/rdf:Description, including prefixes, because they are not required here. Moreover, even if the forEach parameter is used to acquire the list of root elements, the XPath expression used for fields are not relative and there is some minor limitation. For example, abbreviation via the // characters should not be used. Also, note that this usage of the XPathProcessor needs well-formed XML data. This is not true while indexing over RDF/XML serialization (for example in our case, there are many problems in the files downloaded from DBpedia) or even on simple HTML that can be written violating the requirements for XML. In these cases, you should write your own code or adopt a custom script to index your data (as we already do for ours), normalize the data from the scripting counterpart when possible, or finally, you can still use the DataImportHandler after introducing a custom components to sanitize the data.

Importing data from another Solr instance

As we saw, using a standard Solr XML structure allows us to index them with DataImportHandler without problems. A similar and even more direct approach can be used, if needed, for "importing" data from another Solr instance. For example, if we would like to import documents from the cities core, we can do it using the following code:

<dataconfig>
  <document>
    <entity name="city" processor="SolrEntityProcessor" query="*:*" url="http://localhost:8983/solr/cities" />
  </document>
</dataconfig>

As far as I know, this is not very widely used; but it is however useful, simple to adopt, and good to know in advance when needed.

Indexing emails

Indexing emails can be as simple as adopting a configuration similar to the following one:

<document>
  <entity 
    name="emails" 
    processor="MailEntityProcessor" 
    user="username@mail_provider.com" 
    password="password" 
    host="imap.gmail.com" 
    protocol="imaps" 
    recurse="true" 
    folders="INBOX,SENT"
  transformer="script:generateId" />
</document>

In this way, we can construct a simple assisted search (even with faceting!) that can really help us to manage mail conversations and attachments. As far as I know, there isn't any specific CalendarEntityProcessor, and writing this can be a good exercise for starting the acquisition of knowledge needed to write customized components.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.69.157