Every so often, I have to identify or research some Unicode characters. There’s a tool called uni in the Perl 5 distribution App::Uni, developed by Audrey Tang and Ricardo Signes.
Let’s reimplement its basic functionality in a few lines of Raku code and use that as an occasion to talk about Unicode support in Raku
If you give it one character on the command line, it prints out a description of the following character:
$ uni њ
њ - U+0045a - CYRILLIC SMALL LETTER NJE
If you give it a longer string instead, it searches in the list of Unicode character names and prints out the same information for each character whose description matches the search string:
Each line corresponds to what Unicode calls a “code point,” which is usually a character on its own but occasionally also something like U+00300-COMBINING GRAVE ACCENT, which, combined with a-U+00061-LATIN SMALL LETTER A, makes the character à.
Raku offers a method
uniname in both the classes
Str and
Int that produce the Unicode code point name for a given character, either in its direct character form or in the form of its code point number. With that, the first part of
uni’s desired functionality looks like this:
sub format-codepoint(Int $codepoint) {
sprintf "%s - U+%05x - %s
",
$codepoint.chr,
$codepoint,
$codepoint.uniname;
}
multi sub MAIN(Str $x where .chars == 1) {
print format-codepoint($x.ord);
}
Let’s look at it in action:
$ uni ø
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
The chr method
turns a code point number into the character, and ord is the reverse: in other words, from character to code point number.
The second part, searching in all Unicode character names, works by brute force enumerating all possible characters and searching through their
uniname:
multi sub MAIN($search is copy) {
$search.=uc;
for 1..0x10FFFF -> $codepoint {
if $codepoint.uniname.contains($search) {
print format-codepoint($codepoint);
}
}
}
Since all character names are in uppercase, the search term is first converted to uppercase with $search.=uc, which is short for $search = $search.uc. By default, parameters are read-only, which is why its declaration here uses its copy to prevent that.
Instead of this rather imperative style, we can also formulate it in a more functional style. We could think of it as a list of all characters, which we whittle down to those characters that interest us, to finally format them the way we want:
multi sub MAIN($search is copy) {
$search.=uc;
print (1..0x10FFFF).grep(*.uniname.contains($search))
.map(&format-codepoint)
.join;
}
To make it easier to identify (rather than search for) a string of more than one character, an explicit option can help disambiguate:
multi sub MAIN($x, Bool :$identify!) {
print $x.ords.map(&format-codepoint).join;
}
Str.ords returns the list of code points that make up the string. With this multi-candidate of sub
MAIN in place, we can do something like
$ uni --identify øre
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
r - U+00072 - LATIN SMALL LETTER R
e - U+00065 - LATIN SMALL LETTER E
11.1 Code Points, Grapheme Clusters, and Bytes
As alluded to in the preceding, not all code points are fully fledged characters on their own. Or put another way, some things that we visually identify as a single character are actually made up of several code points. Unicode calls such sequences of one base character and potentially several combining characters as a grapheme cluster.
Strings in Raku are based on these
grapheme clusters. If you get a list of characters in a string with
$str.comb, or extract a substring with
$str.substr(0, 4), match a regex against a string, determine the length, or do any other operation on a string, the unit is always the grapheme cluster. This best fits our intuitive understanding of what a character is and avoids accidentally tearing apart a logical character through a
substr,
comb, or similar operation:
my $s = "øc[COMBINING TILDE]";
say $s; # Output: ø˜
say $s.chars; # Output: 1
The Uni type is akin to a string and represents a sequence of codepoints. It is useful in edge cases but doesn’t support the same wealth of operations as Str. The typical way to go from Str to a Uni value is to use one of the NFC, NFD, NFKC, or NFKD methods, which yield a Uni value in the normalization form of the same name.
Below the
Uni level, you can also represent strings as bytes by choosing an encoding. If you want to get from the string to the byte level, call the
encode method:
my $bytes = 'Raku'.encode('UTF-8'); # utf8:0x<52 61 6B 75>
UTF-8 is the default encoding and also the one Raku assumes when reading source files. The result is something that does the Blob role: you can access individual bytes with positional indexing, such as $bytes[0]. The decode method helps you convert a Blob to a Str.
If you print out a Blob with say(), you get a string representation of the bytes in hexadecimal. Accessing individual bytes produces an integer and thus will typically be printed in decimal.
If you want to print out the raw bytes of a blob, you can use the
write method of an I/O handle:
$*OUT.write('Raku'.encode('UTF-8'));
11.2 Numbers
Number literals in Raku aren’t limited to the Arabic digits we are so used to in the English-speaking part of the world. All Unicode code points that have the
Decimal_Number (short
Nd) property are allowed, so you can, for example, use Eastern Arabic numerals
or from many other scripts:
The same holds true for string-to-number
conversions:For other numeric code points, you can use the
unival method to obtain its numeric value:
say "c[TIBETAN DIGIT HALF ZERO]".unival;
which produces the output –0.5 and also illustrates how to use a codepoint by name inside a string literal.
11.3 Other Unicode Properties
The
uniprop method
in type
Str returns the general category by default:
say "ø".uniprop; # Ll
say "c[TIBETAN DIGIT HALF ZERO]".uniprop; # No
The return value needs some Unicode knowledge in order to make sense of it, or one could read Unicode’s Technical Report 44
10 for the gory details.
Ll stands for
Letter_Lowercase;
No is
Other_Number. This is what Unicode calls the
General Category, but you can ask the
uniprop (or
uniprop-bool method if you’re only interested in a boolean result) for other
properties as well:
say "a".uniprop-bool('ASCII_Hex_Digit'); # True
say "ü".uniprop-bool('Numeric_Type'); # False
say ".".uniprop("Word_Break"); # MidNumLet
11.4 Collation
Sorting strings starts to become complicated when you’re not limited to ASCII characters. Raku’s sort method uses the cmp infix operator, which does a pretty standard lexicographic comparison based on the codepoint number.
If you need to use a more sophisticated collation algorithm, Rakudo 2017.09 and newer offer the Unicode
Collation Algorithm11 through the
collate method:
my @list = <a ö ä Ä o ø>;
say @list.sort; # (a o Ä ä ö ø)
say @list.collate; # (a ä Ä o ö ø)
$*COLLATION.set(:tertiary(False));
say @list.collate; # (a Ä ä o ö ø)
The default sort considers any character with diacritics to be larger than ASCII characters, because that’s how they appear in the code point list. On the other hand, collate knows that characters with diacritics belong directly after their base character, which is not perfect in every language12 but usually a good compromise.
For Latin-based scripts, the primary sorting criterion is alphabetical, the secondary is diacritics, and the third is case. $*COLLATION.set(:tertiary(False)) thus makes .collate ignore case, so it doesn’t force lowercase characters to come before uppercase characters anymore.
At the time of writing, language-specific collation has not yet been implemented in Raku.
11.5 Summary
Raku takes languages other than English very seriously and goes to great lengths to facilitate working with them and the characters they use.
This includes basing strings on grapheme clusters rather than code points, support for non-Arabic digits in numbers, and access to large parts of the Unicode database through built-in methods.