Normalizing a string and performing Unicode comparisons

We want to make a filename or URL based on an article title. To do this, we'll have to limit the size to an appropriate number of characters, strip out improper characters, and format the string in a consistent way. We also want it to remain in the valid UTF-8 format.

How to do it…

Let's normalize a string by executing the following steps:

  1. Use std.uni.normalize to get Unicode characters into a consistent format.
  2. Use std.string.toLower to convert everything to lowercase for consistency.
  3. Use std.regex to strip out all but a small set of characters.
  4. Use std.string.squeeze to collapse consecutive whitespace.
  5. Use std.array.replace to change spaces into dashes.
  6. Use std.range.take to get the right number of characters, then convert the result back to string.

The code is as follows:

void main(){
  string title = "The D Programming Language: Easy Speed!";
  import std.uni, std.string, std.conv, std.range,std.regex;
  title = normalize(title);
  title = title.toLower();
  title = std.regex.replaceAll(title, regex(`[^a-z0-9 -]`), "");
  title = title.squeeze(" ")
   .replace(" ", "-")
   .take(32).to!string;
  import std.stdio;
  writeln("The title is: ", title);
}

How it works…

D has excellent facilities to work with strings. Functions from all over Phobos are useful while performing string operations. The std.string function contains functions such as toLower, indexOf, strip, and other functions that are specific to string processing.

D's strings are essentially arrays of immutable characters. This means array functions work too. The replace, repeat, join, and other functions from std.array work, as well as the built-in concatenation and append operators (a ~ b and a ~= b).

Where things might be surprising is if you use strings as ranges with std.algorithm. As strings are arrays, you might expect them to be a full random access range, and thus be usable in all the algorithms, including sort. However, this is not the case, but why? D and Phobos try to reach a happy compromise on Unicode strings that work correctly in all cases and have top performance.

The D language itself works on a low level. A string is simply an array of immutable UTF-8 code units (bytes). It also offers wstring for UTF-16 and 2 bytes per code unit, which allows one index to cover most written language's characters. The wstring type is the default Unicode string type in the Windows operating system. Finally, there is dstring, an array of dchars, which are UTF-32 code units. In the current Unicode specification, a single dchar can hold any code point, which, after calling normalize on the string, brings us as close as we can get to one index being one character. This is at the cost of about four times the memory consumption of a UTF-8 string. D gives you the choice of using the string type appropriate for your own use case.

However, if a string is an array of characters, why doesn't take return a string? First, std.algorithm returns new types that do their calculations lazily. However, there is a function, std.range.array, that eagerly evaluates a range result, converting it from a lazy type back to the underlying array type. If we try this here, we would find the result is dchar[] instead of string.

This is because the Phobos library builds on top of the D language's flexibility to choose a generally correct trade-off of speed for correctness. It avoids expensive full normalization, translating various compatible forms into a single representation, for example, combining characters (though this is available upon request through the std.uni.normalize function). However, it does perform UTF decoding, yielding dchars. This insulates user code from the complexity of explicit multibyte code point decoding—each code point as a single item, instead of one to six, as is the case with raw chars.

The downside of this choice is that computation needs to be done. When decoding the string, it is impossible to perform random access because each UTF-8 point has variable length. You can't jump ahead because you don't know how far to jump!

The std.algorithm.sort function doesn't work on strings without an additional step, either converting to dstring or casting to an array of bytes (std.string.representation is the idiomatic function for performing this cast). This is a good thing because sorting an array of characters probably doesn't do what you want anyway. It would break apart multibyte characters, yielding an invalid string!

However, since many algorithms do not require random access, this is mostly a net win. Things that work efficiently also work correctly, and things that don't work correctly can be made to work. It's your choice whether you want to work on the bytes or convert to dstring.

Getting back to the example, once we perform the algorithm, we can simply convert the final result back to string with std.conv.to.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.17.139