MarkLogic has a default set of rules it uses to tokenize content; that is, to break a stream of text into words, punctuation, and symbols. This default tokenization works well for normal text, but in some cases we might wish to change it. By doing so, we can alter how content is represented in the indexes.
You want to search across Social Security Numbers from different sources, which may have been recorded with or without dashes. In the United States, each citizen has a Social Security Number (SSN), which is used as a unique identifier when interacting with the federal government. These numbers take the form of NNN-NN-NNNN, where each N is a digit.
Applies to MarkLogic versions 7 and higher
We’ll solve this problem using custom tokenization.
To develop this recipe, I used documents that looked like these two:
<doc>
<name>
Alpha</name>
<ssn>
111-22-3333</ssn>
</doc>
<doc>
<name>
Alpha</name>
<ssn>
123456789</ssn>
</doc>
The first step is to create a field with paths that target the elements (or JSON properties) that hold the SSNs. A field may have more than one path, so add a path for each element that has an SSN.
xquery
version
"1.0-ml"
;
import
module
namespace
admin
=
"http://marklogic.com/xdmp/admin"
at
"/MarkLogic/admin.xqy"
;
let
$
db-id
:=
xdmp:database
(
"Documents"
)
let
$
field-name
:=
"SSN"
let
$
paths
:=
(
"/doc/ssn"
)
return
admin:save-configuration
(
admin:database-set-field-value-searches
(
admin:database-add-field
(
admin:get-configuration
(),
$
db-id
,
admin:database-path-field
(
$
field-name
,
admin:database-field-path
(
$
paths
,
1.0
)
)
),
$
db-id
,
$
field-name
,
fn:true
()
)
)
The next step is to override the default tokenization of the field. We’ll have the tokenizer remove the “-” symbol.
xquery
version
"1.0-ml"
;
import
module
namespace
admin
=
"http://marklogic.com/xdmp/admin"
at
"/MarkLogic/admin.xqy"
;
admin:save-configuration
(
admin:database-add-field-tokenizer-override
(
admin:get-configuration
(),
xdmp:database
(
"Documents"
),
"SSN"
,
(
admin:database-tokenizer-override
(
"-"
,
"remove"
))
)
)
With this in place, Social Security Numbers will be tokenized as one series of digits, rather than as three sets separated by dashes. This tokenization will apply to field query strings as well.
When faced with multiple input formats that represent an element differently, the usual approach is the Envelope Pattern, in which we determine one format for the element and make a copy of it for each document, using the selected format. For instance, we might create something like this:
<envelope>
<canonical>
<social-security>
111223333</social-security>
</canonical>
<doc>
<name>
Alpha</name>
<ssn>
111-22-3333</ssn>
</doc>
</envelope>
For each document that has some representation of a Social Security Number, we’d add a <social-security>
element with the data in the no-dash format. This is a widespread technique and the preferred approach for most such problems.
In this case, however, there is a benefit to using custom tokenization: the query itself can use either the with-dash or no-dash representation. To see this, let’s run the following code in Query Console.
xdmp:describe
(
cts:tokenize
(
"111-22-3333"
)),
xdmp:describe
(
cts:tokenize
(
"111-22-3333"
,
(),
"SSN"
)),
xdmp:describe
(
cts:tokenize
(
"111223333"
,
(),
"SSN"
))
These lines tokenize "111-22-3333"
normally, then in the context of the "SSN"
field. The first line tokenizes "111223333"
in the context of the "SSN
field. The results show that within the "SSN"
field, both query strings are viewed the same way. In the default tokenization, the dashes are viewed as punctuation that separates the digit-based words.
(
cts:word
(
"111"
),
cts:punctuation
(
"-"
),
cts:word
(
"22"
),
cts:punctuation
(
"-"
),
cts:word
(
"3333"
)),
cts:word
(
"111223333"
),
cts:word
(
"111223333"
)
This means that either of the following searches will find the target number, whether it appears in a document with dashes or without:
cts:search
(
fn:doc
(),
cts:field-value-query
(
"SSN"
,
"111223333"
)
)
cts:search
(
fn:doc
(),
cts:field-value-query
(
"SSN"
,
"111-22-3333"
)
)
MarkLogic has a default set of rules it uses to tokenize content; that is, to break a stream of text into words, punctuation, and symbols. This default tokenization works well for normal text, but in some cases we might wish to change it. MarkLogic allows us to do that only in the context of a field.
As we build the field, we want to be precise about what the field covers. Note that the path used for the field is /doc/ssn
. We could have used a more general path, like //ssn
. This is a matter of performance—anything covered by the field will need to be evaluated for the alternative tokenization.
18.189.171.125