Recipe 1.10 Treating a Unicode String as Octets
1.10.1 Problem
You have a Unicode string but want
Perl to treat it as octets (e.g., to calculate its length or for
purposes of I/O).
1.10.2 Solution
The
use bytes pragma makes all Perl operations in its
lexical scope treat the string as a group of octets. Use it when your
code is calling Perl's character-aware functions directly:
$ff = "\x{FB00}"; # ff ligature
$chars = length($ff); # length is one character
{
use bytes; # force byte semantics
$octets = length($ff); # length is two octets
}
$chars = length($ff); # back to character semantics
Alternatively, the Encode module lets you convert
a Unicode string to a string of octets, and back again. Use it when
the character-aware code isn't in your lexical scope:
use Encode qw(encode_utf8);
sub somefunc; # defined elsewhere
$ff = "\x{FB00}"; # ff ligature
$ff_oct = encode_utf8($ff); # convert to octets
$chars = somefunc($ff); # work with character string
$octets = somefunc($ff_oct); # work with octet string
1.10.3 Discussion
As explained in this chapter's Introduction, Perl knows about two
types of string: those made of simple uninterpreted octets, and those
made of Unicode characters whose UTF-8 representation may require
more than one octet. Each individual string has a flag associated
with it, identifying the string as either UTF-8 or octets. Perl's I/O
and string operations (such as length) check this
flag and give character or octet semantics accordingly.
Sometimes you need to work with bytes and not characters. For
example, many protocols have a Content-Length
header that specifies the size of the body of a message in octets.
You can't simply use Perl's length function to
calculate the size, because if the string you're calling
length on is marked as UTF-8, you'll get the size
in characters.
The use bytes pragma makes all Perl functions in
its lexical scope use octet semantics for strings instead of
character semantics. Under this pragma, length
always returns the number of octets, and read
always reports the number of octets read. However, because the
use bytes pragma is lexically
scoped, you can't use it to change the behavior of code in another
scope (e.g., someone else's function).
For this you need to create an octet-encoded copy of the UTF-8
string. In memory, of course, the same byte sequence is used for both
strings. The difference is that the copy of your UTF-8 string has the
UTF-8 flag cleared. Functions acting on the octet copy will give
octet semantics, regardless of the scope they're in.
There is also a no bytes pragma, which forces
character semantics, and a decode_utf8 function,
which turns octet-encoded strings into UTF-8 encoded strings.
However, these functions are less useful because not all octet
strings are valid UTF-8 strings, whereas all UTF-8 strings are valid
octet strings.
1.10.4 See Also
The documentation for the bytes pragma; the
documentation for the standard Encode module
|