Chapter 10. Searching with the Optic API

The Optic API, introduced in MarkLogic 9, implements common relational operations over data extracted from documents. This chapter illustrates how to accomplish some common tasks, which should help in transitioning to MarkLogic from a relational background.

Paging Over Results

Problem

An Optic API query returns a large result set. Get a stable set of results a page at a time.

Solution

Applies to MarkLogic versions 9+ and higher

In its simplest form:

const op = require('/MarkLogic/optic');
const pageSize = ...;
const pageNumber = ...;

op.fromTriples([...)])
  .offsetLimit(op.param('offset'), pageSize)
  .result(null, {offset: pageSize * (pageNumber - 1)})

Expanded out to a main module, which allows the caller to specify the page and a timestamp, the recipe looks like this:

const op = require('/MarkLogic/optic');

const pageNumber =
  parseInt(xdmp.getRequestField('page', '1'), 10);
const pageSize = 1000;

let timestamp = xdmp.getRequestField('timestamp');
if (timestamp === null) {
  timestamp = xdmp.requestTimestamp();
}

const response = {
  timestamp: timestamp,
  page: pageNumber,
  results: xdmp.invokeFunction(
    function() {
      return op.fromTriples([...])
        .offsetLimit(op.param('offset'), pageSize)
        .result(null, {offset: pageSize * (pageNumber - 1)});
    },
    { timestamp: timestamp }
  )
}

response

Required Privilege

  • xdmp:timestamp

Discussion

Sometimes your result set will be bigger than you want to return in a single request. Paging solves this problem by having the caller request successive pages until all results have been returned. This means that no individual response is too big, but all results are returned. One of the challenges with paging is the risk that the underlying data set may change, with the result that a row might be skipped or repeated. In this recipe, we’re working through a large set of triples by calling op.fromTriples, but the same principles apply if calling op.fromLexicons, op.fromLiterals, or op.fromView.

This recipe prevents the changing data set problem using point-in-time queries. If you aren’t familiar with how timestamps are managed in MarkLogic, read “Understanding Point-In-Time Queries” in the Application Developer’s Guide.

By using point-in-time queries, we can ask for a batch of results in one request, process them, then ask for the next batch, knowing that the list will not change in-between. Using the main module version of the recipe, the caller is able to specify the page and the timestamp. The timestamp would not be sent with the first request, but the response will indicate at what timestamp the query was run. Subsequent calls can include this to ensure stable results. Point-in-time queries work the same for Optic API queries as they do for others; the differences are in how the data set is gathered and paged.

Note that the REST API provides its own ways to manage timestamps. For example, take a look at the POST /v1/rows endpoint, paying attention to the timestamp parameter and the ML-Effective-Timestamp header.

As with any point-in-time query, one caveat is that the caller should finish before MarkLogic’s merge timestamp catches up to the request timestamp. In practice, this is unlikely to be a problem; if it becomes one, you may need to take control of the merge timestamp to ensure the results remain available.

The offsetLimit call has a reference to op.param(offset). This could have been written with the offset value in place; however, writing it with a variable allows MarkLogic to cache and reuse the query. MarkLogic analyzes the query and builds up a plan. By parameterizing it, this plan can be reused, enabling faster execution.

The caller will need to determine when all results have been provided by watching for an empty result set. While some MarkLogic searches provide an estimate of the total number of results, estimating rows is harder than estimating search because the pipeline of operations can produce more or fewer output than input rows. Even with an estimate, that would not be an exact count, so iterating until empty would be necessary regardless.

See Also

Group By

Your data has properties for names and counts; you want the sums of the counts grouped by name.

Solution

Applies to MarkLogic versions 9 and higher

Assuming you want sums of a property called amount grouped by a property called name:

const op = require('/MarkLogic/optic');

op.fromLexicons({
  name: cts.jsonPropertyReference('name'),
  amount: cts.jsonPropertyReference('amount', ["type=int"])
})
  .groupBy('name', [
    op.sum('totalamount', 'amount')
  ])
  .result();

To see this run, you can create some sample docs to play with:

declareUpdate();

for (let index=0; index < 1000; index++) {
  xdmp.documentInsert(
    '/values' + index + '.json',
    {
      "name": "val" + (xdmp.random(2) + 1),
      "amount": xdmp.random(2) + 1
    }
  )
}

The response I got is below. Yours might be different due to the xdmp.random calls.

{
  "name": "val1",
  "totalamount": 712
},
{
  "name": "val2",
  "totalamount": 659
},
{
  "name": "val3",
  "totalamount": 649
}

Required Indexes

  • range index on name

  • range index on amount

Discussion

MarkLogic provides group-by with the Optic API, which performs relational operations over data extracted from documents.

Optic can work with range indexes, using op.fromLexicons as shown above, but it can also calculate group-bys using op.fromTriples or op.fromView, to work with triples or rows, respectively.

Optic’s groupBy takes a field that you want to group by, along with instructions on how to aggregate the related values. For the sum aggregator, we provide a new name for the aggregated values, along with the source field.

The groupBy function documentation lists the aggregation functions you can use. If you need an aggregation function other than the ones provided, op.uda allows you to call a user-defined function.

See Also

Extract Content from Retrieved Documents

Problem

You have used Template Driven Extraction (TDE) to extract data from documents into the row index, but there are additional pieces of information you want to return with a particular Optic query.

Solution

Applies to MarkLogic versions 9 and higher

Consider a schema called content with a view called books. The view has columns title, author, and price. The documents also have an element called summary that you decided not to put in the view, as it won’t be used very often. For this query, however, you want to include it in the results.

const op = require('/MarkLogic/optic');

const docId = op.fragmentIdCol('docId');
op.fromView('content', 'books', null, docId)
  .where(op.gt(op.col('price'), 20))
  .joinDoc('doc', docId)
  .select([
    'title', 'author', 'price',
    op.as('summary', op.xpath(op.col('doc'),
      '/doc/summary/fn:string()'))
  ])
  .result();

Required Index

  • TDE-extracted rows

Discussion

The Optic API is useful for executing relational operations on values; however, sometimes we need to pull in additional data from the source documents. It makes sense to pull scalar values into the row index, where we can do calculations and aggregations on them, but content with substructure, such as XML with markup, doesn’t really benefit. The example presented here is relatively simple, but supplementing the results with content from documents can be done with much more complex queries, involving joins, aggregates, or other operations.

The op.fromView call identifies the schema and view that we’ll draw data from. We also use op.fragmentIdCol to let Optic know that we want to work with the source documents. In this case, I’ve used the name “docId,” but the name isn’t meaningful as we won’t be using it once we get our results.

Before we actually join the documents to our rows, we should filter down to just those rows we need, in order to avoid reading more documents from disk than is necessary. In the recipe above, only books with prices greater than 20 are included.

Having told Optic what data to work with, .joinDoc tells Optic to use the docId to include the document content in a column called doc.

Finally, select specifies what columns to return in the result set. The as clause tells Optic to make a summary column based on an XPath statement, run against the document associated with the current row.

When using this technique, we’ll see columns from the row index, as well as the specified additional data from the documents, in this case the summary. It’s important to remember that to get the summary, MarkLogic had to load the entire document, then use XPath to select just a part of it. This is similar to any other search where we return just part of search result. To ensure this performs well, use a where clause to reduce the number of documents that need to be retrieved. Consider using paging if you still get a lot of results. If the documents are large, considering adding the additional data to the row index instead of retrieving it separately.

See Also

Select Documents Based on Criteria in Joined Documents

Problem

Select documents based on criteria that are located in different documents. You have Template Driven Extraction (TDE) templates that extract data from both document types into views.

Solution

Applies to MarkLogic versions 9 and higher

This recipe assumes that you have two types of documents: one describing factories, and one describing events (problems) that happen at those factories. A TDE template has extracted a factory ID and the state where the factory is located into the factories view of the manufacturing schema. Another template has extracted columns from the events documents: factoryFKey (factory foreign key), severity (of the event), and status (Active or Resolved). These columns are in the events view.

const op = require('/MarkLogic/optic');

const docId = op.fragmentIdCol('docId');

op.fromView('manufacturing', 'events')
  .where(
    op.and(
      op.eq(op.col('severity'), 1),
      op.eq(op.col('status'), 'Active')
    ))
  .groupBy('factoryFKey')
  .joinInner(
    op.fromView('manufacturing', 'factories', null, docId),
    op.on('factoryFKey', 'factoryID')
  )
  .joinDoc('doc', docId)
  .select('doc')
  .result();

Required Indexes

  • row index with data from TDE extracted into two related views (in this case, factories and events)

Discussion

The desired result from this query is the set of factory documents that have a severity 1, active event going on. The severity and status are in the events view, so we need to join the views to get to the information we want.

The query starts with the events view, as that is where the search criteria are found. The where clause limits the rows that we need to work with prior to doing the join, which is based on the factory ID columns. This early scoping is important for performance.

Once the selection is made and the join between the views is done, we join with the documents. Notice that as part of the joinInner, the op.fromView that connects to the factories view refers to docId. This is how the joinDoc call knows which document to retrieve.

The final step is to select the doc column and return just that. Optic will return a Sequence, which your code can then iterate over.

See Also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.158.165