Team LiB   Previous Section   Next Section

Conversion Modules

Perl provides a variety of modules that you can use to convert from one data format to another. In Table D-7 we list what we think are the most useful conversion modules available from CPAN. All of them should also be available via ActivePerl's PPM, except possibly Convert::Recode, which requires the use of the GNU recode program; we'll describe that one shortly.

Table D-7. Perl conversion modules

CPAN module

Description/CPAN Address

Convert::EBCDIC

Written by Chris Leach; converts between EBCDIC and ASCII format.

http://www.cpan.org/authors/id/CXL

Convert::Recode

Written by Ed Avis, built upon work from Gisle Aas; creates a Perl front end to the GNU recode library (described in the next section).

http://www.cpan.org/authors/id/E/ED/EDAVIS

Convert::SciEng

Written by Colin Kuskie; converts numbers with scientific- and engineering-style suffixes.

http://www.cpan.org/authors/id/COLINK

Convert::Translit

Written by Genji Schmeder; converts between 8-bit character sets.

http://www.cpan.org/authors/id/GENJISCH

Convert::Units

Written by Robert Rothenberg; converts unit measurements, such as meters, to other units, such as inches.

http://www.cpan.org/authors/id/R/RR/RRWO

Convert::UU

Written by Andreas J. König; used for uuencode and uudecode work.

http://www.cpan.org/authors/id/ANDK

Convert::Recode and GNU recode

The Convert::Recode module provides a front end to the GNU recode library, which is a powerhouse of conversion operations. You can download this library, which was written by François Pinard, from:

http://www.gnu.org/software/recode/recode.html
ftp://ftp.gnu.org/gnu/recode

The recode library converts between more than 300 different character sets, depending on what's possible upon your operating system. The following command tells you what sets you have access to, once you've installed recode:

$ recode -l

On SuSE 7.3 Linux, we had 281 character sets, from arabic7 to MacGreek.

You can install this library as follows:

  1. Once you have the tarball downloaded, unpack as follows:

    $ gzip -d recode-3.6.tar.gz
    $ tar xvf recode-3.6.tar
    $ cd recode-3.6
  2. Before configuring, take a look at the INSTALL file:

    $ vi README INSTALL
    $ ./configure
    $ make
  3. Instead of make test, as with Perl modules, use make check instead:

    $ make check
    ...
    ============================
    All 95 tests were successful
    ============================
    ...
  4. Now we can install:

    $ make install
  5. Once the recode install completes, we're ready to install Perl's Convert::Recode module, which is a standard Perl install.

Convert::Recode is unusual in that you roll your own methods directly from it. Simply identify the two character sets you wish to convert between, such as ascii and ebcdic, and then decide the conversion direction. Once you've decided, just add a _to_ string between the two character set names and then import the final method via Convert::Recode. For example:

use Convert::Recode qw(ascii_to_ebcdic);

or:

use Convert::Recode qw(ebcdic_to_ ascii);

We've created two short programs; recodeAscEbc.pl in Example D-3, and recodeEbcAsc.pl in Example D-4. We're going to use these to:

  • Convert feedRecode.txt into an EBCDIC equivalent, ebcdicRecode.txt

  • Then re-convert this back into an ASCII file called outRecode.txt

ASCII to EBCDIC — recodeAscEbc.pl
#!perl -w
  
use Convert::Recode qw(ascii_to_ebcdic);
  
while (<>) {
   print ascii_to_ebcdic($_);
}
EBCDIC to ASCII — recodeEbcAsc.pl
#!perl -w
  
use Convert::Recode qw(ebcdic_to_ascii);
  
while (<>) {
   print ebcdic_to_ascii($_);
}

The original feedRecode.txt file looks like this:

To sit in solemn silence, 
In a dull dank dock,
In a pestilential prison, 
With a life long lock,
Awaiting the sensation of a short sharp shock,
From a cheap and chippy chopper,
On a big black block

The execution run, which converts this file from ASCII into EBCDIC and then back again, looks like this:

$ perl recodeAscEbc.pl feedRecode.txt > ebcdicRecode.txt
$ perl recodeEbcAsc.pl ebcdicRecode.txt > outRecode.txt

This conversion run is displayed in the ASCII-based vi editor in Figure D-4.

Figure D-4. Convert::Recode at work
figs/pdba_ad04.gif

Text Conversion Modules

Perl comes with a number of text-based conversion modules bundled into it. These are listed in Table D-8.

Table D-8. Perl-bundled text processing modules

Module

Description

Text::Abbrev

Written by the Perl 5 porters; when supplied with an array, Text::Abbrev returns a hash of keyed abbreviations and original string values (see Example D-5).

Text::ParseWords

Written by Hal Pomeranz; parses text into token arrays or arrays of arrays (see Example D-6).

Text::Soundex

Written by Mike Stok; a Perl implementation of Donald Knuth's Soundex algorithm (see Example D-7).

Text::Tabs

Written by David Muir Sharnoff; does what the Unix utilities expand( ) and unexpand( ) do. Given a line with tabs, expand replaces them with a specified number of spaces. The unexpand method adds tabs to a line when it can save bytes by doing so.

Text::Wrap

Written by David Muir Sharnoff; this line wrapper forms simple paragraphs from munged lines.

Let's take a look at some of these modules in action.

Text::Abbrev

Example D-5 takes a list of amino acids, creates an abbreviation hash, and then iterates over it, creating a uniquely sorted hash of the smallest possible abbreviations before displaying it.

Text list abbreviations — textAbbrev.pl
#!perl -w
  
use strict;
use Text::Abbrev('abbrev');
  
# The Stuff of Life
  
my %h1 = abbrev qw(Alanine Cysteine Aspartic_Acid Glutamic_Acid
            Phenylalanine Glycine Histidine Isoleucine Lysine
            Leucine Methionine Asparagine Proline Glutamine
            Arginine Serine Threonine Valine Tryptophan Tyrosine);
my %h2;
  
for my $abb_key (keys %h1) {
  
   # Iterate through the hash, producing all keys and values.
   # Build up a 2nd hash, with the smallest possible abbreviations.
  
   # Have we started filling the 2nd hash yet, with reversed data?
  
   if (defined ($h2{ $h1{$abb_key} } )){
  
      # Yes, we already have an abbreviation.  Is the current one 
      # longer than the new one?  If so, replace it.
  
      if (length($h2{ $h1{$abb_key} } ) > length($abb_key)){
  
         # This abbreviation is shorter, so we replace.
  
         $h2{ $h1{$abb_key} } = $abb_key;
      }
  
   } else {
  
      # Provide our first value, for hash 2. Reverse the sense 
      # of the hash. The value becomes key, the key becomes the value.
  
      $h2{ $h1{$abb_key} } = $abb_key;
   }
}
  
# Now we've built up our reduced hash, print it out.
  
for my $min_key (sort keys %h2) {
   printf("%15s : %15s\n", $min_key, $h2{$min_key});
}

The results are as follows:

$ perl textAbbrev.pl
        Alanine :              Al
       Arginine :              Ar
     Asparagine :          Aspara
  Aspartic_Acid :          Aspart
       Cysteine :               C
  Glutamic_Acid :        Glutamic
      Glutamine :        Glutamin
        Glycine :             Gly
      Histidine :               H
     Isoleucine :               I
        Leucine :              Le
         Lysine :              Ly
     Methionine :               M
  Phenylalanine :              Ph
        Proline :              Pr
         Serine :               S
      Threonine :              Th
     Tryptophan :              Tr
       Tyrosine :              Ty
         Valine :               V
Text::ParseWords

This time, in Example D-6, we'll split a list of words into separate elements via a regular expression splitting on white space. You may sometimes want to include spaces inside the strings, and we can do this with either quote characters or backslash escapes. We'll then create a tagged list of values, in XML format, to send them further down a potential munge chain.

Text list parsing — textParseWords.pl
#!perl -w
  
use strict;
use Text::ParseWords('quotewords');
  
# We want to keep the spaces within Aspartic Acid, and Glutamic acid.  
# We can do this in two ways, either by using non-escaped quote marks, 
# or escaped space characters.  To cut things down a bit, we'll only 
# use amino acids beginning with "A" or "G".
  
my @amino_acids =
   quotewords('\s+', # Regular Expression to split on white space
              0,
              q{   Alanine "Aspartic Acid"   Glutamic\  Acid
                 Glycine         Asparagine    Glutamine Arginine} );
  
print '<?xml version="1.0"?>', "\n";
print '<!DOCTYPE Genetics SYSTEM "genetics.dtd">', "\n";
  
for my $array_element (sort @amino_acids) {
   printf("<Amino_Acid>%s</Amino_Acid>\n", $array_element);
}

This produces the following XML-style output. All the spaces have gone, except the ones we wanted to keep. Mission accomplished:

$ perl textParseWords.pl
<?xml version="1.0"?>
<!DOCTYPE Genetics SYSTEM "genetics.dtd"> 
<Amino_Acid>Alanine</Amino_Acid>
<Amino_Acid>Arginine</Amino_Acid>
<Amino_Acid>Asparagine</Amino_Acid>
<Amino_Acid>Aspartic Acid</Amino_Acid>
<Amino_Acid>Glutamic Acid</Amino_Acid>
<Amino_Acid>Glutamine</Amino_Acid>
<Amino_Acid>Glycine</Amino_Acid>
Text::Soundex

In Example D-7 we want to find all the sound-alike amino acids. This is so we can put checks into a later munge process and avoid word confusion, as in John le Carré's spy novel, Tinker, Tailor, Soldier, Spy, where "Tinker," "Tailor," "Soldier," and "Poor Man" (for George Smiley) were used as codes for possible traitorous moles. This avoided "Tailor" getting confused with the more usual "Sailor." (You may notice the similarity between Text::Soundex, and Oracle's SOUNDEX function which is based on exactly the same Knuthian algorithm — see the first part of this appendix for more on such algorithms.)

Identifying soundalikes — textSoundex.pl
#!perl -w
  
use strict;
use Text::Soundex('soundex');
  
# Yet More Stuff of Life.  We want to find out the amino acids
# which sound the same.
  
my @amino_array =
      ('Alanine', 'Cysteine', 'Aspartic Acid', 'Glutamic Acid',
       'Phenylalanine', 'Glycine', 'Histidine', 'Isoleucine', 'Lysine',
       'Leucine', 'Methionine', 'Asparagine', 'Proline', 'Glutamine',
       'Arginine', 'Serine', 'Threonine', 'Valine', 'Tryptophan',
       'Tyrosine'
      );
  
# Build up all the Soundex codes, for the array above.
  
my @soundex_codes = soundex @amino_array;
  
# Now we want to build up a hash of amino acids that sound
# like each other.  We'll do this by going through the Sortex codes,
# and add up counters on a temporary hash.
  
my %soundex_count_hash;
  
for my $soundex_element (sort @soundex_codes) {
   $soundex_count_hash{$soundex_element}++;
}
  
# Now if anything in the @soundex_codes list, has at least a double, 
# it is going to have a value of at least 2, in the %soundex_count_hash 
# variable. So now we can go through that, and when we find the double+ 
# values, we'll whizz through the @amino_array, and add to our new
# %doubles_hash.
  
my %doubles_hash;
  
for my $soundex_key (keys %soundex_count_hash) {
  
   if ($soundex_count_hash{$soundex_key} > 1) {
  
      # Ah, we've found a code that had at least 2 ++ operations
      # performed on it, earlier.  Find the amino acids, which
      # produced this code, and add them to the final hash.
  
      for my $amino_element (@amino_array) {
  
         # Regenerate the code for the amino acid and compare.
  
         if ($soundex_key eq soundex $amino_element) {
  
            # The soundex codes are the same.  Hurrah! :-)
  
            $doubles_hash{$amino_element} = $soundex_key;
         }
      }
   }
}
  
# Finally, print out the soundalike list, with soundex codes first.
  
for my $amino_element (sort keys %doubles_hash) {
   printf("%10s : %s\n",$doubles_hash{$amino_element},$amino_element);
}

Here are the results:

$ perl textSoundex.pl
      A216 : Asparagine
      A216 : Aspartic Acid
      G435 : Glutamic Acid
      G435 : Glutamine
      L250 : Leucine


      L250 : Lysine



    Team LiB   Previous Section   Next Section