© Moritz Lenz 2020
M. LenzRaku Fundamentals https://doi.org/10.1007/978-1-4842-6109-5_11

11. A Unicode Search Tool

Moritz Lenz1 
(1)
Fürth, Bayern, Germany
 

Every so often, I have to identify or research some Unicode characters. There’s a tool called uni1 in the Perl 5 distribution App::Uni,2 developed by Audrey Tang and Ricardo Signes.

Let’s reimplement its basic functionality in a few lines of Raku code and use that as an occasion to talk about Unicode support in Raku

If you give it one character on the command line, it prints out a description of the following character:
$ uni њ
њ - U+0045a - CYRILLIC SMALL LETTER NJE

If you give it a longer string instead, it searches in the list of Unicode character names and prints out the same information for each character whose description matches the search string: ../images/449994_2_En_11_Chapter/449994_2_En_11_Figa_HTML.gif

Each line corresponds to what Unicode calls a “code point,” which is usually a character on its own but occasionally also something like U+00300-COMBINING GRAVE ACCENT, which, combined with a-U+00061-LATIN SMALL LETTER A, makes the character à.

Raku offers a method uniname in both the classes Str and Int that produce the Unicode code point name for a given character, either in its direct character form or in the form of its code point number. With that, the first part of uni’s desired functionality looks like this:
#!/usr/bin/env raku
use v6;
sub format-codepoint(Int $codepoint) {
    sprintf "%s - U+%05x - %s ",
        $codepoint.chr,
        $codepoint,
        $codepoint.uniname;
}
multi sub MAIN(Str $x where .chars == 1) {
    print format-codepoint($x.ord);
}
Let’s look at it in action:
$ uni ø
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE

The chr method turns a code point number into the character, and ord is the reverse: in other words, from character to code point number.

The second part, searching in all Unicode character names, works by brute force enumerating all possible characters and searching through their uniname:
multi sub MAIN($search is copy) {
    $search.=uc;
    for 1..0x10FFFF -> $codepoint {
        if $codepoint.uniname.contains($search) {
            print format-codepoint($codepoint);
        }
    }
}

Since all character names are in uppercase, the search term is first converted to uppercase with $search.=uc, which is short for $search = $search.uc. By default, parameters are read-only, which is why its declaration here uses its copy to prevent that.

Instead of this rather imperative style, we can also formulate it in a more functional style. We could think of it as a list of all characters, which we whittle down to those characters that interest us, to finally format them the way we want:
multi sub MAIN($search is copy) {
    $search.=uc;
    print (1..0x10FFFF).grep(*.uniname.contains($search))
                       .map(&format-codepoint)
                       .join;
}
To make it easier to identify (rather than search for) a string of more than one character, an explicit option can help disambiguate:
multi sub MAIN($x, Bool :$identify!) {
    print $x.ords.map(&format-codepoint).join;
}
Str.ords returns the list of code points that make up the string. With this multi-candidate of sub MAIN in place, we can do something like
$ uni --identify øre
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
r - U+00072 - LATIN SMALL LETTER R
e - U+00065 - LATIN SMALL LETTER E

11.1 Code Points, Grapheme Clusters, and Bytes

As alluded to in the preceding, not all code points are fully fledged characters on their own. Or put another way, some things that we visually identify as a single character are actually made up of several code points. Unicode calls such sequences of one base character and potentially several combining characters as a grapheme cluster.

Strings in Raku are based on these grapheme clusters. If you get a list of characters in a string with $str.comb, or extract a substring with $str.substr(0, 4), match a regex against a string, determine the length, or do any other operation on a string, the unit is always the grapheme cluster. This best fits our intuitive understanding of what a character is and avoids accidentally tearing apart a logical character through a substr, comb, or similar operation:
my $s = "øc[COMBINING TILDE]";
say $s;         # Output: ø˜
say $s.chars;   # Output: 1

The Uni3 type is akin to a string and represents a sequence of codepoints. It is useful in edge cases but doesn’t support the same wealth of operations as Str.4 The typical way to go from Str to a Uni value is to use one of the NFC, NFD, NFKC, or NFKD methods, which yield a Uni value in the normalization form of the same name.

Below the Uni level, you can also represent strings as bytes by choosing an encoding. If you want to get from the string to the byte level, call the encode5 method:
my $bytes = 'Raku'.encode('UTF-8');   # utf8:0x<52 61 6B 75>

UTF-8 is the default encoding and also the one Raku assumes when reading source files. The result is something that does the Blob6 role: you can access individual bytes with positional indexing, such as $bytes[0]. The decode method7 helps you convert a Blob to a Str.

If you print out a Blob with say(), you get a string representation of the bytes in hexadecimal. Accessing individual bytes produces an integer and thus will typically be printed in decimal.

If you want to print out the raw bytes of a blob, you can use the write method of an I/O handle:
$*OUT.write('Raku'.encode('UTF-8'));

11.2 Numbers

Number literals in Raku aren’t limited to the Arabic digits we are so used to in the English-speaking part of the world. All Unicode code points that have the Decimal_Number (short Nd) property are allowed, so you can, for example, use Eastern Arabic numerals8 or from many other scripts:
say ٤٢;             # 42
The same holds true for string-to-number conversions:
say "٤٢".Int;       # 42
For other numeric code points, you can use the unival method to obtain its numeric value:
say "c[TIBETAN DIGIT HALF ZERO]".unival;

which produces the output –0.5 and also illustrates how to use a codepoint by name inside a string literal.

11.3 Other Unicode Properties

The uniprop method9 in type Str returns the general category by default:
say "ø".uniprop;                            # Ll
say "c[TIBETAN DIGIT HALF ZERO]".uniprop;  # No
The return value needs some Unicode knowledge in order to make sense of it, or one could read Unicode’s Technical Report 4410 for the gory details. Ll stands for Letter_Lowercase; No is Other_Number. This is what Unicode calls the General Category, but you can ask the uniprop (or uniprop-bool method if you’re only interested in a boolean result) for other properties as well:
say "a".uniprop-bool('ASCII_Hex_Digit');   # True
say "ü".uniprop-bool('Numeric_Type');      # False
say ".".uniprop("Word_Break");             # MidNumLet

11.4 Collation

Sorting strings starts to become complicated when you’re not limited to ASCII characters. Raku’s sort method uses the cmp infix operator, which does a pretty standard lexicographic comparison based on the codepoint number.

If you need to use a more sophisticated collation algorithm, Rakudo 2017.09 and newer offer the Unicode Collation Algorithm11 through the collate method:
my @list = <a ö ä Ä o ø>;
say @list.sort;                      # (a o Ä ä ö ø)
say @list.collate;                   # (a ä Ä o ö ø)
$*COLLATION.set(:tertiary(False));
say @list.collate;                   # (a Ä ä o ö ø)

The default sort considers any character with diacritics to be larger than ASCII characters, because that’s how they appear in the code point list. On the other hand, collate knows that characters with diacritics belong directly after their base character, which is not perfect in every language12 but usually a good compromise.

For Latin-based scripts, the primary sorting criterion is alphabetical, the secondary is diacritics, and the third is case. $*COLLATION.set(:tertiary(False)) thus makes .collate ignore case, so it doesn’t force lowercase characters to come before uppercase characters anymore.

At the time of writing, language-specific collation has not yet been implemented in Raku.

11.5 Summary

Raku takes languages other than English very seriously and goes to great lengths to facilitate working with them and the characters they use.

This includes basing strings on grapheme clusters rather than code points, support for non-Arabic digits in numbers, and access to large parts of the Unicode database through built-in methods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.197.201