Conversion ModulesPerl provides a variety of modules that you can use to convert from one data format to another. In Table D-7 we list what we think are the most useful conversion modules available from CPAN. All of them should also be available via ActivePerl's PPM, except possibly Convert::Recode, which requires the use of the GNU recode program; we'll describe that one shortly.
Convert::Recode and GNU recodeThe Convert::Recode module provides a front end to the GNU recode library, which is a powerhouse of conversion operations. You can download this library, which was written by François Pinard, from: The recode library converts between more than 300 different character sets, depending on what's possible upon your operating system. The following command tells you what sets you have access to, once you've installed recode: $ recode -l On SuSE 7.3 Linux, we had 281 character sets, from arabic7 to MacGreek. You can install this library as follows:
Convert::Recode is unusual in that you roll your own methods directly from it. Simply identify the two character sets you wish to convert between, such as ascii and ebcdic, and then decide the conversion direction. Once you've decided, just add a _to_ string between the two character set names and then import the final method via Convert::Recode. For example: use Convert::Recode qw(ascii_to_ebcdic); or: use Convert::Recode qw(ebcdic_to_ ascii); We've created two short programs; recodeAscEbc.pl in Example D-3, and recodeEbcAsc.pl in Example D-4. We're going to use these to:
ASCII to EBCDIC — recodeAscEbc.pl#!perl -w use Convert::Recode qw(ascii_to_ebcdic); while (<>) { print ascii_to_ebcdic($_); } EBCDIC to ASCII — recodeEbcAsc.pl#!perl -w use Convert::Recode qw(ebcdic_to_ascii); while (<>) { print ebcdic_to_ascii($_); } The original feedRecode.txt file looks like this: To sit in solemn silence, In a dull dank dock, In a pestilential prison, With a life long lock, Awaiting the sensation of a short sharp shock, From a cheap and chippy chopper, On a big black block The execution run, which converts this file from ASCII into EBCDIC and then back again, looks like this: $ perl recodeAscEbc.pl feedRecode.txt > ebcdicRecode.txt $ perl recodeEbcAsc.pl ebcdicRecode.txt > outRecode.txt This conversion run is displayed in the ASCII-based vi editor in Figure D-4. Figure D-4. Convert::Recode at workText Conversion ModulesPerl comes with a number of text-based conversion modules bundled into it. These are listed in Table D-8.
Let's take a look at some of these modules in action. Text::AbbrevExample D-5 takes a list of amino acids, creates an abbreviation hash, and then iterates over it, creating a uniquely sorted hash of the smallest possible abbreviations before displaying it. Text list abbreviations — textAbbrev.pl#!perl -w use strict; use Text::Abbrev('abbrev'); # The Stuff of Life my %h1 = abbrev qw(Alanine Cysteine Aspartic_Acid Glutamic_Acid Phenylalanine Glycine Histidine Isoleucine Lysine Leucine Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine); my %h2; for my $abb_key (keys %h1) { # Iterate through the hash, producing all keys and values. # Build up a 2nd hash, with the smallest possible abbreviations. # Have we started filling the 2nd hash yet, with reversed data? if (defined ($h2{ $h1{$abb_key} } )){ # Yes, we already have an abbreviation. Is the current one # longer than the new one? If so, replace it. if (length($h2{ $h1{$abb_key} } ) > length($abb_key)){ # This abbreviation is shorter, so we replace. $h2{ $h1{$abb_key} } = $abb_key; } } else { # Provide our first value, for hash 2. Reverse the sense # of the hash. The value becomes key, the key becomes the value. $h2{ $h1{$abb_key} } = $abb_key; } } # Now we've built up our reduced hash, print it out. for my $min_key (sort keys %h2) { printf("%15s : %15s\n", $min_key, $h2{$min_key}); } The results are as follows: $ perl textAbbrev.pl
Alanine : Al
Arginine : Ar
Asparagine : Aspara
Aspartic_Acid : Aspart
Cysteine : C
Glutamic_Acid : Glutamic
Glutamine : Glutamin
Glycine : Gly
Histidine : H
Isoleucine : I
Leucine : Le
Lysine : Ly
Methionine : M
Phenylalanine : Ph
Proline : Pr
Serine : S
Threonine : Th
Tryptophan : Tr
Tyrosine : Ty
Valine : V
Text::ParseWordsThis time, in Example D-6, we'll split a list of words into separate elements via a regular expression splitting on white space. You may sometimes want to include spaces inside the strings, and we can do this with either quote characters or backslash escapes. We'll then create a tagged list of values, in XML format, to send them further down a potential munge chain. Text list parsing — textParseWords.pl#!perl -w use strict; use Text::ParseWords('quotewords'); # We want to keep the spaces within Aspartic Acid, and Glutamic acid. # We can do this in two ways, either by using non-escaped quote marks, # or escaped space characters. To cut things down a bit, we'll only # use amino acids beginning with "A" or "G". my @amino_acids = quotewords('\s+', # Regular Expression to split on white space 0, q{ Alanine "Aspartic Acid" Glutamic\ Acid Glycine Asparagine Glutamine Arginine} ); print '<?xml version="1.0"?>', "\n"; print '<!DOCTYPE Genetics SYSTEM "genetics.dtd">', "\n"; for my $array_element (sort @amino_acids) { printf("<Amino_Acid>%s</Amino_Acid>\n", $array_element); } This produces the following XML-style output. All the spaces have gone, except the ones we wanted to keep. Mission accomplished: $ perl textParseWords.pl
<?xml version="1.0"?>
<!DOCTYPE Genetics SYSTEM "genetics.dtd">
<Amino_Acid>Alanine</Amino_Acid>
<Amino_Acid>Arginine</Amino_Acid>
<Amino_Acid>Asparagine</Amino_Acid>
<Amino_Acid>Aspartic Acid</Amino_Acid>
<Amino_Acid>Glutamic Acid</Amino_Acid>
<Amino_Acid>Glutamine</Amino_Acid>
<Amino_Acid>Glycine</Amino_Acid>
Text::SoundexIn Example D-7 we want to find all the sound-alike amino acids. This is so we can put checks into a later munge process and avoid word confusion, as in John le Carré's spy novel, Tinker, Tailor, Soldier, Spy, where "Tinker," "Tailor," "Soldier," and "Poor Man" (for George Smiley) were used as codes for possible traitorous moles. This avoided "Tailor" getting confused with the more usual "Sailor." (You may notice the similarity between Text::Soundex, and Oracle's SOUNDEX function which is based on exactly the same Knuthian algorithm — see the first part of this appendix for more on such algorithms.) Identifying soundalikes — textSoundex.pl#!perl -w use strict; use Text::Soundex('soundex'); # Yet More Stuff of Life. We want to find out the amino acids # which sound the same. my @amino_array = ('Alanine', 'Cysteine', 'Aspartic Acid', 'Glutamic Acid', 'Phenylalanine', 'Glycine', 'Histidine', 'Isoleucine', 'Lysine', 'Leucine', 'Methionine', 'Asparagine', 'Proline', 'Glutamine', 'Arginine', 'Serine', 'Threonine', 'Valine', 'Tryptophan', 'Tyrosine' ); # Build up all the Soundex codes, for the array above. my @soundex_codes = soundex @amino_array; # Now we want to build up a hash of amino acids that sound # like each other. We'll do this by going through the Sortex codes, # and add up counters on a temporary hash. my %soundex_count_hash; for my $soundex_element (sort @soundex_codes) { $soundex_count_hash{$soundex_element}++; } # Now if anything in the @soundex_codes list, has at least a double, # it is going to have a value of at least 2, in the %soundex_count_hash # variable. So now we can go through that, and when we find the double+ # values, we'll whizz through the @amino_array, and add to our new # %doubles_hash. my %doubles_hash; for my $soundex_key (keys %soundex_count_hash) { if ($soundex_count_hash{$soundex_key} > 1) { # Ah, we've found a code that had at least 2 ++ operations # performed on it, earlier. Find the amino acids, which # produced this code, and add them to the final hash. for my $amino_element (@amino_array) { # Regenerate the code for the amino acid and compare. if ($soundex_key eq soundex $amino_element) { # The soundex codes are the same. Hurrah! :-) $doubles_hash{$amino_element} = $soundex_key; } } } } # Finally, print out the soundalike list, with soundex codes first. for my $amino_element (sort keys %doubles_hash) { printf("%10s : %s\n",$doubles_hash{$amino_element},$amino_element); } Here are the results: $ perl textSoundex.pl
A216 : Asparagine
A216 : Aspartic Acid
G435 : Glutamic Acid
G435 : Glutamine
L250 : Leucine
L250 : Lysine
|