Chapter 12. Tokenization

MarkLogic has a default set of rules it uses to tokenize content; that is, to break a stream of text into words, punctuation, and symbols. This default tokenization works well for normal text, but in some cases we might wish to change it. By doing so, we can alter how content is represented in the indexes.

Tokenizing Social Security Numbers

Problem

You want to search across Social Security Numbers from different sources, which may have been recorded with or without dashes. In the United States, each citizen has a Social Security Number (SSN), which is used as a unique identifier when interacting with the federal government. These numbers take the form of NNN-NN-NNNN, where each N is a digit.

Solution

Applies to MarkLogic versions 7 and higher

We’ll solve this problem using custom tokenization.

To develop this recipe, I used documents that looked like these two:

<doc>
  <name>Alpha</name>
  <ssn>111-22-3333</ssn>
</doc>
<doc>
  <name>Alpha</name>
  <ssn>123456789</ssn>
</doc>

The first step is to create a field with paths that target the elements (or JSON properties) that hold the SSNs. A field may have more than one path, so add a path for each element that has an SSN.

xquery version "1.0-ml";

import module namespace admin =
  "http://marklogic.com/xdmp/admin"
  at "/MarkLogic/admin.xqy";

let $db-id := xdmp:database("Documents")
let $field-name := "SSN"
let $paths := (
  "/doc/ssn"
)
return
  admin:save-configuration(
    admin:database-set-field-value-searches(
      admin:database-add-field(
        admin:get-configuration(),
        $db-id,
        admin:database-path-field(
          $field-name,
          admin:database-field-path($paths, 1.0)
        )
      ),
      $db-id,
      $field-name,
      fn:true()
    )
  )

The next step is to override the default tokenization of the field. We’ll have the tokenizer remove the “-” symbol.

xquery version "1.0-ml";
import module namespace admin =
  "http://marklogic.com/xdmp/admin"
  at "/MarkLogic/admin.xqy";

admin:save-configuration(
  admin:database-add-field-tokenizer-override(
    admin:get-configuration(),
    xdmp:database("Documents"),
    "SSN",
    (admin:database-tokenizer-override("-", "remove"))
  )
)

With this in place, Social Security Numbers will be tokenized as one series of digits, rather than as three sets separated by dashes. This tokenization will apply to field query strings as well.

Discussion

When faced with multiple input formats that represent an element differently, the usual approach is the Envelope Pattern, in which we determine one format for the element and make a copy of it for each document, using the selected format. For instance, we might create something like this:

<envelope>
 <canonical>
   <social-security>111223333</social-security>
 </canonical>
 <doc>
   <name>Alpha</name>
   <ssn>111-22-3333</ssn>
 </doc>
</envelope>

For each document that has some representation of a Social Security Number, we’d add a <social-security> element with the data in the no-dash format. This is a widespread technique and the preferred approach for most such problems.

In this case, however, there is a benefit to using custom tokenization: the query itself can use either the with-dash or no-dash representation. To see this, let’s run the following code in Query Console.

xdmp:describe(cts:tokenize("111-22-3333")),
xdmp:describe(cts:tokenize("111-22-3333", (), "SSN")),
xdmp:describe(cts:tokenize("111223333", (), "SSN"))

These lines tokenize "111-22-3333" normally, then in the context of the "SSN" field. The first line tokenizes "111223333" in the context of the "SSN field. The results show that within the "SSN" field, both query strings are viewed the same way. In the default tokenization, the dashes are viewed as punctuation that separates the digit-based words.

(cts:word("111"), cts:punctuation("-"), cts:word("22"),
 cts:punctuation("-"), cts:word("3333")),
cts:word("111223333"),
cts:word("111223333")

This means that either of the following searches will find the target number, whether it appears in a document with dashes or without:

cts:search(
  fn:doc(),
  cts:field-value-query("SSN", "111223333")
)
cts:search(
  fn:doc(),
  cts:field-value-query("SSN", "111-22-3333")
)

MarkLogic has a default set of rules it uses to tokenize content; that is, to break a stream of text into words, punctuation, and symbols. This default tokenization works well for normal text, but in some cases we might wish to change it. MarkLogic allows us to do that only in the context of a field.

As we build the field, we want to be precise about what the field covers. Note that the path used for the field is /doc/ssn. We could have used a more general path, like //ssn. This is a matter of performance—anything covered by the field will need to be evaluated for the alternative tokenization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.125