Chapter 11. Input Transformations

Data modeling is an iterative process with MarkLogic—we load data as is and then revise it over time. Sooner or later, however, we’ll need to modify the structure of the data. The recipes in this chapter show how to change documents, using tools like MarkLogic Content Pump (MLCP), REST API transforms, or CORB2, an open source bulk processing tool.

Changing Date Format

Problem

You’re loading documents, but the data have dates in a nonstandard format. Each document has multiple dates. You want to fix them during ingest.

Solution

Applies to MarkLogic versions 8 and higher

Here’s the code for an MLCP transform that will fix dates in a specified list of JSON properties in newly ingested JSON documents. This code can also be used with REST input transforms or CORB2 jobs.

Save this content in dateTransform.sjs. Add it to your modules database, as described in the documentation.

// Recurse through a JSON document, applying
// function f to any JSON property whose property
// name is in the array keys.
function applyToProperty(obj, keys, f) {
  for (var i in obj) {
    if (!obj.hasOwnProperty(i)) {
      continue;
    }
    else if (typeof obj[i] === 'object') {
      applyToProperty(obj[i], keys, f);
    }
    else if (keys.indexOf(i) !== -1) {
      obj[i] = f.call(this, obj[i]);
    }
  }
}

function fixDate(value) {
  return new Date(value).toISOString().substring(0, 10);
}

// This is the MLCP transform function. Fix any date with the
// property name(s) specified by context.transform_param.
// Property names must be separated by a semicolon.
function fixDateByProp(content, context)
{
  var propNames = (context.transform_param == undefined)
                 ? "UNDEFINED" : context.transform_param;
  propNames = propNames.split(';');

  var docType = xdmp.nodeKind(content.value);
  if (xdmp.nodeKind(content.value) == 'document' &&
      content.value.documentFormat == 'JSON') {
    // Convert input to mutable object and add new property
    var newDoc = content.value.toObject();
    applyToProperty(newDoc, propNames, fixDate);

    // Convert result back into a document
    content.value = xdmp.unquote(xdmp.quote(newDoc));
  }
  return content;
};

exports.fixDateByProp = fixDateByProp;

Calling the transform with MLCP looks like this:

$ mlcp.sh import -mode local -host srchost -port 8000 
    -username user -password password 
    -input_file_path /space/mlcp-test/data 
    -transform_module /example/date-transform.sjs 
    -transform_param "date;pub-date"

Discussion

MLCP allows a call to specify one string parameter that will be sent to a transform. The transform function uses that parameter to specify where the target dates are. The parameter takes a semicolon-separated list of JSON properties.

The code accepts multiple names of properties or elements, and each instance of those will be found. For example, if a JSON document has three date properties, then all three will have their values updated by the fixDate() function.

JavaScript can handle a variety of input formats by simply passing them to the Date constructor. “2017-01-01”, “1/1/2017”, “1/1/17”, “January 1, 2017”, and several other formats will all be properly interpreted. The toISOString funtion then outputs the date in the format needed for MarkLogic dateTime indexes. The substring call drops the time information, leaving just the date.

This function will look through an entire JSON document, looking for any property that matches its target list. If you know exactly where the properties are that you want to change, it is faster to simply change them directly. The benefit of this approach is flexibility.

This recipe follows a good practice for making code testable: the real work is done in library functions that are called by the MLCP transform function. You can write unit tests for dt:change-dates() or for applyToProperty() and fixDate(). The function that MLCP actually calls just handles the inputs from MLCP and sends them to the functions that do the real work. Because it’s set up this way, you can use the same functions to support a REST API input transformation or a CORB2 transform.

Converting Binaries to Base64 Strings and Back

Problem

Your XML or JSON data has a base64 string in it, representing an image or other binary content. The base64 string is stored in an XML element or JSON property. You want to store this data efficiently and be able to access it easily.

Solution

Applies to MarkLogic versions 7 and higher

This recipe pulls the base64 string out of the XML or JSON and stores it as a new binary document.

xquery version "1.0-ml";

let $file :=
  xdmp:document-get(
    '/home/dcassel/downloads/HD-keyhole-300px.png'
  )
let $xml :=
  <root>
    <filename>HD-keyhole-300px.png</filename>
    <base64>iVBORw0KGgoAAAAN...</base64>
  </root>
let $binary := binary{xs:hexBinary(xs:base64Binary($base64))}
return (
  xdmp:document-insert(
    $xml/filename/fn:string(),
    binary{
      xs:hexBinary(xs:base64Binary($xml/base64/fn:string()))
    }
  ),
  xdmp:node-delete($base)
)

This code can be used as part of a REST API or MLCP import transformation, or with a CORB2 job.

Discussion

There are good reasons to go to the trouble to extract the base64 content and store it separately as a binary.

Base64 is, by nature, large. That makes for large strings in XML or JSON documents. The example above has a string with a length of 2,560 characters. Whenever we update a document in MarkLogic, the MVCC approach means we make a copy of that document. The older versions will be removed by the merging process, but in the meantime, they’ll take up space without contributing much value. If we separate the binary content, then it won’t need to be copied over when the XML or JSON document gets updated.

Another impact of base64 strings is that all text is normally included in the indexes. Again, there’s no value in indexing values like this—they’re simply large values that no one’s going to search for. It is possible to configure the database to exclude values based on an element or property name, but simply removing them simplifies the configuration.

Finally, let’s think about how we’d make use of the binary content. Mostly likely, we’d want to serve it up as an image, in its binary form. Doing so is simpler if the content has been extracted, converted back to a binary form, and stored that way. Retrieving it then becomes a simple matter of loading the binary content from disk and returning it to the client, rather than converting it from base64 to binary at runtime.

See Also

Ingesting an Aggregate JSON File with Many Documents Inside

Problem

Your input data consists of a single, large JSON file that contains an array of objects. Each object represents a separate entity. You want to represent each entity as a document within MarkLogic.

Solution

Applies to MarkLogic versions 8 and higher

Since the JSON objects represent different entities, they should be stored as separate documents in MarkLogic. To accomplish this, we can split the input file during load.

declareUpdate();

// Insert URL of zipped JSON file
let url = "REPLACE WITH URL";
// Insert the name of the JSON file within the .zip
let zipFile = "REPLACE WITH FILENAME";
let zip = xdmp.documentGet(url);

let idx = 0;
let results =
  fn.head(xdmp.zipGet(zip, zipFile)).xpath("./results")
for (let rec of results) {
  xdmp.documentInsert("/content/rec-" + idx++ + ".json", rec);
}

Discussion

The basic unit of storage in MarkLogic is a document. When we have an input document that describes many entities, we’ll want to split those into one document per entity. It’s not unusual to find aggregate documents in this form.

MarkLogic Content Pump (MLCP) provides ways to split two types of aggregate documents: XML documents where each child of the root element will become its own document, and line-delimited JSON documents, in which each line is a separate bit of JSON. If you have a document that looks like either of these, your best bet is to use MLCP. This recipe covers a different case where you’ve got one large JSON document that contains an array, and you want to split each item in the array into a document in MarkLogic.

The sample .zip file I worked with is a large JSON file that starts off like this:

{
  "meta": {
    "last_updated": "2017-08-29",
    "terms": "https://open.fda.gov/terms/",
    "results": {
      "skip": 0,
      "total": 253355,
      "limit": 100000
    },
    "license": "https://open.fda.gov/license/",
    "disclaimer": "Do not rely on openFDA to make decisions..."
  },
  "results": [
    {

Note that xdmp:document-get will accept a URL that starts with http:// or file://—that is, it will pull a file down from the web or read one from the local filesystem.

In either case, the xdmp:zip-get($zip, $zip-file)/results expression reads the JSON document from the zip file, then applies an XPath expression to select the results property. The results property holds an array, each item of which is an object that we want to store in a separate document. We loop on the sequence of results, inserting each as a new document.

Be aware that if your JSON document is very large, you may get a timeout from attempting to insert all the documents in a single transaction. You can try increasing the timeout on the request (xdmp.setRequestTimeLimit / xdmp:set-request-time-limit). If that doesn’t work, the best approach is to do the orchestration (splitting the JSON document and inserting the results) from outside MarkLogic, using the Data Movement SDK.

An alternative approach is to use the code above to split the JSON, but perform the document inserts in separate transactions using xdmp.spawn. This may seem appealing, but it comes with risks. These spawned tasks will be placed in the Task Server queue to be executed asynchronously. If MarkLogic went down (accidentally or due to a deliberate restart), any tasks remaining in the queue would be lost. Worse, it would be difficult to determine whether the tasks had finished, except by checking for the presence of the expected documents. If the same process is managed outside of MarkLogic, and MarkLogic went down, the external program could report that fact. (And if the external program went down, it would be clear that an error had occurred and that not all documents had been processed.) Perhaps the biggest advantage of working externally is that xdmp.spawn can only put tasks on the queue of the host that it’s running on; if operating as part of a cluster, it won’t be able to take advantage of that extra power. DMSDK makes it much simpler to spread work across a cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.26.221