Unicode provides a unique number for every character, regardless of the computing platform, program, or programming language. This is particularly important because without a standard such as Unicode, computers would continue to use different encoding classes for characters, many of which would conflict if character classes were used together.
Unicode support was introduced to Perl with Perl 5.6. Although it is still not completely adherent in the Unicode spec, Unicode support has matured significantly under Perl 5.8. You can now use Unicode reliably with file I/O and with regular expressions. With regular expressions, the pattern will adapt to the data and will automatically switch to the correct Unicode character scheme.
Perl’s Unicode implementation falls into the following categories:
There is currently no way in Perl to mark data that’s read from or written to a file as being of type Unicode (utf8). Future versions of Perl will support such a feature.
The determination whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters and not when matching happens at runtime. This will be changed to match Unicode characters at runtime.
use utf8
The utf8 module is still needed to enable a few Unicode
features. The utf8
pragma, as
implemented by the utf8 module, implements tables used for
Unicode support. You must load the utf8
pragma explicitly to enable
recognition of UTF-8 encoded literals and identifiers in the
source text.
As of 5.6.0, Perl uses logically wide characters to
represent strings internally. This internal representation uses
the UTF-8 encoding. Future versions of Perl will work with
characters rather than bytes. This was a purposeful decision
made so Perl 5.6 could transition from byte semantics to
character semantics in programs. Perl will make the decision to
switch to character semantics if it finds that the input data
has characters on which it can safely operate with UTF-8. You
can disable character semantics by using the bytes
pragma, as explained in Chapter 8. Character semantics
have the following effects:
Strings and patterns may contain characters that have an ordinal value larger than 255.
Identifiers within a Perl program may contain Unicode alphanumeric characters.
Regular expressions match characters and not bytes.
Character classes in regular expressions match characters and not bytes.
Named Unicode properties and block ranges may be used
as character classes with the p
and P
constructs.
X
matches any
extended Unicode sequence.
tr//
matches
characters instead of bytes.
Case translation operators use the Unicode case translation tables when provided character input.
Most operators that deal with positions or lengths in a string switch to using character positions.
pack( )
and
unpack( )
do not
change.
Bit operators work on characters.
scalar reverse( )
reverses characters and not bytes.
18.220.88.62