Recipe 1.10 Treating a Unicode String as Octets

1.10.1 Problem

You have a Unicode string but want Perl to treat it as octets (e.g., to calculate its length or for purposes of I/O).

1.10.2 Solution

The use bytes pragma makes all Perl operations in its lexical scope treat the string as a group of octets. Use it when your code is calling Perl's character-aware functions directly:

$ff = "\x{FB00}";             # ff ligature
$chars = length($ff);         # length is one character
{
  use bytes;                  # force byte semantics
  $octets = length($ff);      # length is two octets
}
$chars = length($ff);         # back to character semantics

Alternatively, the Encode module lets you convert a Unicode string to a string of octets, and back again. Use it when the character-aware code isn't in your lexical scope:

use Encode qw(encode_utf8);

sub somefunc;                 # defined elsewhere

$ff = "\x{FB00}";             # ff ligature
$ff_oct = encode_utf8($ff);   # convert to octets

$chars = somefunc($ff);       # work with character string
$octets = somefunc($ff_oct);  # work with octet string

1.10.3 Discussion

As explained in this chapter's Introduction, Perl knows about two types of string: those made of simple uninterpreted octets, and those made of Unicode characters whose UTF-8 representation may require more than one octet. Each individual string has a flag associated with it, identifying the string as either UTF-8 or octets. Perl's I/O and string operations (such as length) check this flag and give character or octet semantics accordingly.

Sometimes you need to work with bytes and not characters. For example, many protocols have a Content-Length header that specifies the size of the body of a message in octets. You can't simply use Perl's length function to calculate the size, because if the string you're calling length on is marked as UTF-8, you'll get the size in characters.

The use bytes pragma makes all Perl functions in its lexical scope use octet semantics for strings instead of character semantics. Under this pragma, length always returns the number of octets, and read always reports the number of octets read. However, because the use bytes pragma is lexically scoped, you can't use it to change the behavior of code in another scope (e.g., someone else's function).

For this you need to create an octet-encoded copy of the UTF-8 string. In memory, of course, the same byte sequence is used for both strings. The difference is that the copy of your UTF-8 string has the UTF-8 flag cleared. Functions acting on the octet copy will give octet semantics, regardless of the scope they're in.

There is also a no bytes pragma, which forces character semantics, and a decode_utf8 function, which turns octet-encoded strings into UTF-8 encoded strings. However, these functions are less useful because not all octet strings are valid UTF-8 strings, whereas all UTF-8 strings are valid octet strings.

1.10.4 See Also

The documentation for the bytes pragma; the documentation for the standard Encode module

[ Team LiB ]