Unicode provides a unique number for every character, regardless of the computing platform, program, or programming language. This is particularly important because without a standard such as Unicode, computers would continue to use different encoding classes for characters, many of which would conflict if character classes were used together.
Unicode support was introduced to Perl with Perl 5.6. Although it is still not completely adherent in the Unicode spec, Unicode support has matured significantly under Perl 5.8. You can now use Unicode reliably with file I/O and with regular expressions. With regular expressions, the pattern will adapt to the data and will automatically switch to the correct Unicode character scheme.
Perl's Unicode implementation falls into the following categories:
Strings and patterns may contain characters that have an ordinal value larger than 255.
Identifiers within a Perl program may contain Unicode alphanumeric characters.
Regular expressions match characters and not bytes.
Character classes in regular expressions match characters and not bytes.
Named Unicode properties and block ranges may be used as character classes with the \p and \P constructs.
\X matches any extended Unicode sequence.
tr// matches characters instead of bytes.
Case translation operators use the Unicode case translation tables when provided character input.
Most operators that deal with positions or lengths in a string switch to using character positions.
pack( ) and unpack( ) do not change.
Bit operators work on characters.
scalar reverse( ) reverses characters and not bytes.
Copyright © 2002 O'Reilly & Associates. All rights reserved.