Time for action – handling sub-entities (for example, joins on complex data)

In most cases, we will not necessarily find plain records while reading the data from external sources. So, you should be able to define sub-entities by "navigating" some kind of data structure to eventually construct our full Solr document. We can collect fields at the document level, at the entity root level, and at every level on sub-entities, whenever we are able to read the corresponding data. Because relational databases are used in many contexts, this approach is useful in a wide number of scenarios when we have to index the existing data for an e-commerce site, blog, or an application. In these cases in particular, being able to work with a divide-and-conquer approach could be a strong improvement in our way to proceed.

In order to keep things as simple as possible, we can use the same data as the previous example, from another point of view. Initially, we used the SQL JOIN query to return a collection of single plain records, creating a tuple that combines data from two different tables (and you know there can be many more such tuples). A tuple is an ordered list of elements, usually an "abstract" record in SQL. A good, simple (and funny) reference site for SQL statements is: http://sqlzoo.net/wiki/Main_Page.

Even if from the SQL point of view we are handling tuples for the actual records, from the DataImportHandler point of view we can simply see them as rows of data. Once it is able to read them, there is not much distinction between them and a line read from a textual file. This idea is simple and yet useful, because even if we are reading content with SQL now, we can apply the same approach to a very different data source, pretty much in the same way as in the following example.

So, another approach could be, reading a row of data from the first table (paintings) and searching for a collection of related row in the second table (authors). This can be done as follows:

<document name="paintingBy_author_document">
  <entity name="painting_by_author" pk="uri" 
    transformer="script:normalizeUri" 
    rootEntity="true" 
    query="SELECT * FROM paintings AS P WHERE (P.FORM='painting')">

    <field column="entity_type" template="painting" />
    <field column="entity_source" template="Web Gallery Of Art" />

    <entity name="author" pk="A.ID" 
      rootEntity="false" 
      transformer="script:normalizeArtistEntity" 
      query="SELECT * FROM authors AS A WHERE (A.ID='${painting_by_author.AUTHOR_ID}')" />

  </entity>
</document>

In the code, I have used the selected * statement to improve readability. If you want, you can find the full projections in the configuration for a specific wgarts_sub example core. Note how the transformers are used here on a per-entity basis.

Remember that at the moment, we are not thinking about performance or the best ways to handle this kind of data from a SQL point of view here, we are only exploring possibilities using the same data, introducing sub-entities as an exercise. Our intention is to do a comparison with the previous example.

Tip

However, the consideration about performance is not trivial when using sub-entities in particular, especially when we have to manage lots of data and joins.

To get a more specific idea please read: http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport.

What just happened?

In this case, we have divided the problem into two parts. Without using the SQL JOIN query, we are designing the reading of data to be indexed in two steps: a main step for the entity paintings and a dependent step for the entity authors.

Selecting a list of "rows" representing paintings is really simple, and it is done as follows:

SELECT P.*
FROM paintings AS P WHERE (FORM='painting') 

After this, all we have to do is look up the data about an author related to this particular painting using the following code:

SELECT * FROM authors AS A WHERE (A.ID='${painting_by_author.AUTHOR_ID}')

Here, we are using an internal binding in the ${entity_name.field_name} form for selecting the current value of an author ID. Then, we use it as a select criteria on a very specific and fast query.

We have adopted an explicit rootEntity parameter here, just to suggest that when you handle two or more entities, you can have situations where you need to read from a container entity. This will not be the one in which you are most interested. For example, if you're reading a filesystem, and you are mostly interested in file metadata, you will still need to explore folders as root entities in order to inspect the files contained in each folder.

The transformers are applied here on a per-entity basis, and this is probably slightly more readable.

If you want to have a wider idea on the kind of transformers you can use for our projects, there is a very informative page on wiki that will give us references: https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.227.82