Recipe 1.20 Parsing Comma-Separated Data

1.20.1 Problem

You have a data file containing comma-separated values that you need to read, but these data fields may have quoted commas or escaped quotes in them. Most spreadsheets and database programs use comma-separated values as a common interchange format.

1.20.2 Solution

If your data file follows normal Unix quoting and escaping conventions, where quotes within a field are backslash-escaped "like \"this\"", use the standard Text::ParseWords and this simple code:

use Text::ParseWords;
sub parse_csv0 {
    return quotewords("," => 0, $_[0]);
}

However, if quotes within a field are doubled "like ""this""", you could use the following procedure from Mastering Regular Expressions, Second Edition:

sub parse_csv1 {
    my $text = shift;      # record containing comma-separated values
    my @fields  = ( );

    while ($text =~ m{
        # Either some non-quote/non-comma text:
        ( [^"',] + )

         # ...or...
         | 

        # ...a double-quoted field: (with "" allowed inside)

        " # field's opening quote; don't save this
         (   now a field is either
          (?:     [^"]    # non-quotes or
              |
                  ""      # adjacent quote pairs
           ) *  # any number
         )
        " # field's closing quote; unsaved

    }gx)
    {
      if (defined $1) {
          $field = $1;
      } else {
          ($field = $2) =~ s/""/"/g;
      }
      push @fields, $field;
    }
    return @fields;
}

Or use the CPAN Text:CSV module:

use Text::CSV;
sub parse_csv1 {
    my $line = shift;
    my $csv = Text::CSV->new( );              
    return $csv->parse($line) && $csv->fields( );           
}

Or use the CPAN Tie::CSV_File module:

tie @data, "Tie::CSV_File", "data.csv";

for ($i = 0; $i < @data; $i++) {
    printf "Row %d (Line %d) is %s\n", $i, $i+1, "@{$data[$i]}";
    for ($j = 0; $j < @{$data[$i]}; $j++) {
        print "Column $j is <$data[$i][$j]>\n";
    } 
}

1.20.3 Discussion

Comma-separated input is a deceptive and complex format. It sounds simple, but involves a fairly complex escaping system because the fields themselves can contain commas. This makes the pattern-matching solution complex and rules out a simple split /,/. Still worse, quoting and escaping conventions vary between Unix-style files and legacy systems. This incompatibility renders impossible any single algorithm for all CSV data files.

The standard Text::ParseWords module is designed to handle data whose quoting and escaping conventions follow those found in most Unix data files. This makes it eminently suitable for parsing the numerous colon-separated data files found on Unix systems, including disktab(5), gettytab(5), printcap(5), and termcap(5). Pass that module's quotewords function two arguments and the CSV string. The first argument is the separator (here a comma, but often a colon), and the second is a true or false value controlling whether the strings are returned with quotes around them.

In this style of data file, you represent quotation marks inside a field delimited by quotation marks by escaping them with backslashes "like\"this\"". Quotation marks and backslashes are the only characters that have meaning when backslashed. Any other use of a backslash will be left in the output string. The standard Text::ParseWords module's quotewords( ) function can handle such data.

However, it's of no use on data files from legacy systems that represent quotation marks inside such a field by doubling them "like""this""". For those, you'll need one of the other solutions. The first of these is based on the regular expression from Mastering Regular Expressions, Second Edition, by Jeffrey E. F. Friedl (O'Reilly). It enjoys the advantage of working on any system without requiring installation of modules not found in the standard distribution. In fact, it doesn't use any modules at all. Its slight disadvantage is the risk of sending the unseasoned reader into punctuation shock, despite its copious commenting.

The object-oriented CPAN module Text::CSV demonstrated in the next solution hides that parsing complexity in more easily digestible wrappers. An even more elegant solution is offered by the Tie::CSV_File module from CPAN, in which you are given what appears to be a two-dimensional array. The first dimension represents each line of the file, and the second dimension each column on each row.

Here's how you'd use our two kinds of parse_csv subroutines. The q( ) is just a fancy quote so we didn't have to backslash everything.

$line = q(XYZZY,"","O'Reilly, Inc","Wall, Larry","a \"glug\" bit,",5,"Error, Core 
Dumped");
@fields = parse_csv0($line);
for ($i = 0; $i < @fields; $i++) {
    print "$i : $fields[$i]\n";
}

0 : XYZZY
1 : 
2 : O'Reilly, Inc
3 : Wall, Larry
4 : a "glug" bit,
5 : 5
6 : Error, Core Dumped

If the second argument to quotewords had been 1 instead of 0, the quotes would have been retained, producing this output instead:

0 : XYZZY
1 : ""
2 : "O'Reilly, Inc"
3 : "Wall, Larry"
4 : "a \"glug\" bit,"
5 : 5
6 : "Error, Core Dumped"

The other sort of data file is manipulated the same way, but using our parse_csv1 function instead of parse_csv0. Notice how the embedded quotes are doubled, not escaped.

$line = q(Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K);
@fields = parse_csv1($line);
for ($i = 0; $i < @fields; $i++) {
    print "$i : $fields[$i]\n";
}

0 : Ten Thousand
1 : 10000
2 :  2710 
3 : 
4 : 10,000
5 : It's "10 Grand", baby
6 : 10K

1.20.4 See Also

The explanation of regular expression syntax in perlre(1) and Chapter 5 of Programming Perl; the documentation for the standard Text::ParseWords module; the section on "Parsing CSV Files" in Chapter 5 of Mastering Regular Expressions, Second Edition

[ Team LiB ]