Chapter 13. Template-Driven Extraction

Template-Driven Extraction (TDE), new in MarkLogic 9, lets us put values directly into the row or triple index, without having to modify the actual document structure. We define a template that tells MarkLogic where to look for values and how to record them. We can have transformation code in the template itself, acting on a template level, rather than modifying the documents.

One of the tricks with using TDE is good use of collections. Each input source should be in a distinct collection, which enables us to write templates that target each collection individually. As with other strategies to harmonize data from different sources, we want to simplify queries by writing some simple code that maps from multiple input formats to a single query format.

TDE also supports doing data source harmonization in an agile way. When we need another piece of information in the index, we can simply update the template. While this still requires reindexing, keeping the original documents intact simplifies data governance.

Searching on Derived Data

Problem

You want to search on derived data, such as day of the week (Monday, Tuesday, …), but your data only has a date (2017-06-21).

Solution

Applies to MarkLogic versions 9 and higher

The solution is to use Template-Driven Extraction to put the derived information directly into the row or triple index. With that in mind, there are two parts to the solution: the template and the actual search.

My sample documents are game records from a chess site where I used to play (I lost this one):

{
  "Event": "Clan challenge",
  "Date": "2010-06-22",
  "EndDate": "2010.12.04",
  "Round": "?",
  "White": "dmcassel",
  "Black": "ColeFord",
  "WhiteRating": "1544",
  "BlackRating": "1483",
  "WhiteELO": "1544",
  "BlackELO": "1483",
  "Result": "0-1",
  "GameId": "7540292"
}

I built a template that extracts the derived day of the week into a row. Here’s the code to insert the template:

var tde = require("/MarkLogic/tde.xqy");

var dowRowTemplate =
  xdmp.toJSON({
    "template": {
      "context": "/GameId",
      "collections": [ "source1" ],
      "rows": [
        {
          "schemaName": "chess",
          "viewName": "matches",
          "columns": [
            {
              "name": "dayOfWeek",
              "scalarType": "int",
              "val": "xdmp:weekday-from-date(../Date)"
            }
          ]
        }
      ]
    }
  });

tde.templateInsert(
  "/tde/dowRowTemplate.json",
  dowRowTemplate,
  xdmp.defaultPermissions(),
  ["TDE"]
)

And here’s Optic API code to search for documents that have dates that fall on Tuesdays:

const op = require('/MarkLogic/optic');

const dow = 2; // 1=Monday, 7=Sunday

const docId = op.fragmentIdCol('docId');

op.fromView('chess', 'matches', null, docId)
  .where(op.eq(op.col('dayOfWeek'), dow))
  .offsetLimit(0, 10)
  .joinDoc('doc', docId)
  .select(['doc'])
  .result();

This code returns the documents that match the selected day of the week.

We can also use Optic to group the game by days of the week, providing counts of games played on each day:

const op = require('/MarkLogic/optic');

op.fromView('chess', 'matches')
  .groupBy('dayOfWeek', [op.count('dowCount', 'dayOfWeek')])
  .result();

The result of this one is a sequence of items, where each item has a dayOfWeek value (1-7) and a dowCount (day-of-week count), giving the count of how many games took place on each day.

Required Indexes

  • The triples index must be on; it’s on by default in MarkLogic 9 and higher.

Discussion

There’s a lot to look at here. Your data will likely be a bit different, so you’ll need to adjust the template accordingly. See the TDE tutorial for more details on how to build a template. You’ll need to update the context and collections to identify your data, as well as the columns to populate in your schema.

The first thing to note is that, as with any application of TDE, we can put information into the indexes without having to modify the documents themselves. For some MarkLogic users, it is important that the data remain in its original format. While we often address this concern using the Envelope Pattern, TDE provides a no-touch way to accomplish something similar. More generally, we will often use a transform to construct business entities, then apply TDE for fine-tuning what is available in the indexes.

The template I’m using here is very small, extracting just one column, which provides the derived day of week. The template can be revised at any time to provide additional data. Doing so could let us, for instance, find matches played by a particular player as Black on a chosen day of the week. The Optic API is not restricted to working with data in the row index, so this is a question of whether we’d want to index other fields in the row index or use a range index.

The val specification for each column consists of XQuery code. Whatever derived value you want to push into the index, here’s where you do it. In the day-of-week case, xdmp:weekday-from-date takes a properly formatted date and returns a 1-7 value, where 1 = Monday and 7 = Sunday.

We can use tde.nodeDataExtract to run the template against a target document and ensure that it does what we want. After that, we insert the template with tde.templateInsert, which triggers applying the templates to documents that match the context, collections, and directories that are part of the template. Notice that I did not need to include a URI column in the schema—MarkLogic tracks the origin of the data internally. The op.fragmentIdCol call gives me access to this. I can choose whatever name I like for the fragment ID column, so long as it doesn’t conflict with other column names in the schema. It’s worth noting that this ID is an internal one, not the document’s URI. The ID is not intended to be shown to users; rather, it’s simply a connector between a document and a row.

This particular data set is very flat: all of the properties are direct children of the root. The TDE template needs to pick out a node, so I selected the GameId property. The template uses relative paths to access the other JSON properties, such as ../Date.

The first query shows a document search for matches that took place on Tuesday. Notice that the where clause is applied before the joinDoc. This is important to minimize the number of documents that need to be loaded. The code also uses offsetLimit to control the number of result rows (in this case, documents) that will be returned. After joining the rows to the documents from which they were generated, the select statement narrows the query to return just the document contents, rather than any of the columns in the row.

The second query shows a groupBy result. From this, we can find out how many matches were played on each day of the week. This query is run fully against the row index, with no need to load up the original documents. The result is seven rows (assuming matches took place on each day) with two columns: dayOfWeek and dowCount.

See Also

Using an IRI Namespace with TDE

Problem

An IRI is an Internationalized Resource Identifier, which is an extension of the URI (Uniform Resource Indicator) used to generate unique identifiers, including the URIs for documents in MarkLogic. These IRIs, used in RDF triples, often have a part that acts as a namespace, followed by a more specific part for identifing an entity or predicate.

While generating triples with Template-Driven Extraction (TDE), you want to reuse a namespace while generating triples.

Solution

Applies to MarkLogic versions 9 and higher

Use template variables to hold the prefix. By doing this, you don’t need to repeat the namespace itself, just concatenate the variable name with the rest of the IRI.

var tde = require("/MarkLogic/tde.xqy");

var myTriplesTemplate =
  xdmp.toJSON({
    "template": {
      "context": "/match",
      "collections": [ "source1", "source2" ],
      "vars": [
        {
          "name": "prefix-ex",
          "val": "'http://example.org#'"
        },
      "triples": [
        {
          "subject": {
            "val": "sem:iri($prefix-ex || 'sample-subject')"
          },
          "predicate": {
            "val": "sem:iri($prefix-ex || 'sample-predicate')"
          },
          "object": {
            "val": "sem:iri($prefix-ex || 'sample-object')"
          }
        }
      ]
    }
  });

Discussion

Templates allow the construction of variables, which can then be used in populating rows or triples. Notice that the variable names can use hyphenated names, even if you’re working with JavaScript. That’s because of the context in which they will be used. The "val" portion of a triple specification is evaluated as XQuery code. Hence, even in a JSON template, we have "subject": { "val": "sem:iri($prefix-ex || sample-subject)" }.

Using variables in your templates reduces the amount of repeated code, thus decreasing the chances of a mistake creeping in.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.188.121