Recipe 1.20 Parsing Comma-Separated Data
1.20.1 Problem
You have a data file containing
comma-separated values that you need to read, but these data fields
may have quoted commas or escaped quotes in them. Most spreadsheets
and database programs use comma-separated values as a common
interchange format.
1.20.2 Solution
If your data file follows normal Unix quoting and escaping
conventions, where quotes within a field are backslash-escaped
"like \"this\"", use the standard Text::ParseWords
and this simple code:
use Text::ParseWords;
sub parse_csv0 {
return quotewords("," => 0, $_[0]);
}
However, if quotes within a field are doubled "like
""this""", you could use the following procedure from
Mastering Regular Expressions, Second Edition:
sub parse_csv1 {
my $text = shift; # record containing comma-separated values
my @fields = ( );
while ($text =~ m{
# Either some non-quote/non-comma text:
( [^"',] + )
# ...or...
|
# ...a double-quoted field: (with "" allowed inside)
" # field's opening quote; don't save this
( now a field is either
(?: [^"] # non-quotes or
|
"" # adjacent quote pairs
) * # any number
)
" # field's closing quote; unsaved
}gx)
{
if (defined $1) {
$field = $1;
} else {
($field = $2) =~ s/""/"/g;
}
push @fields, $field;
}
return @fields;
}
Or use the CPAN Text:CSV module:
use Text::CSV;
sub parse_csv1 {
my $line = shift;
my $csv = Text::CSV->new( );
return $csv->parse($line) && $csv->fields( );
}
Or use the CPAN Tie::CSV_File module:
tie @data, "Tie::CSV_File", "data.csv";
for ($i = 0; $i < @data; $i++) {
printf "Row %d (Line %d) is %s\n", $i, $i+1, "@{$data[$i]}";
for ($j = 0; $j < @{$data[$i]}; $j++) {
print "Column $j is <$data[$i][$j]>\n";
}
}
1.20.3 Discussion
Comma-separated input is a deceptive and complex format. It sounds
simple, but involves a fairly complex escaping system because the
fields themselves can contain commas. This makes the pattern-matching
solution complex and rules out a simple split
/,/. Still worse, quoting and escaping conventions
vary between Unix-style files and legacy systems. This
incompatibility renders impossible any single algorithm for all CSV
data files.
The standard Text::ParseWords module is
designed to handle data whose quoting and escaping conventions follow
those found in most Unix data files. This makes it eminently suitable
for parsing the numerous colon-separated data files found on Unix
systems, including disktab(5),
gettytab(5), printcap(5),
and termcap(5). Pass that module's
quotewords function two arguments and the CSV
string. The first argument is the separator (here a comma, but often
a colon), and the second is a true or false value controlling whether
the strings are returned with quotes around
them.
In this style of data file, you represent quotation marks inside a
field delimited by quotation marks by escaping them with backslashes
"like\"this\"". Quotation marks and backslashes
are the only characters that have meaning when backslashed. Any other
use of a backslash will be left in the output string. The standard
Text::ParseWords module's quotewords( ) function
can handle such data.
However, it's of no use on data files from legacy systems that
represent quotation marks inside such a field by doubling them
"like""this""". For those, you'll need one of the
other solutions. The first of these is based on the regular
expression from Mastering Regular Expressions,
Second Edition, by Jeffrey E. F. Friedl (O'Reilly). It
enjoys the advantage of working on any system without requiring
installation of modules not found in the standard distribution. In
fact, it doesn't use any modules at all. Its slight disadvantage is
the risk of sending the unseasoned reader into punctuation shock,
despite its copious commenting.
The
object-oriented CPAN module Text::CSV demonstrated in the next
solution hides that parsing complexity in more easily digestible
wrappers. An even more elegant solution is offered by the
Tie::CSV_File module from CPAN, in which you are given what appears
to be a two-dimensional array. The first dimension represents each
line of the file, and the second dimension each column on each row.
Here's
how you'd use our two kinds of parse_csv
subroutines. The q( ) is just a fancy quote so we
didn't have to backslash everything.
$line = q(XYZZY,"","O'Reilly, Inc","Wall, Larry","a \"glug\" bit,",5,"Error, Core
Dumped");
@fields = parse_csv0($line);
for ($i = 0; $i < @fields; $i++) {
print "$i : $fields[$i]\n";
}
0 : XYZZY
1 :
2 : O'Reilly, Inc
3 : Wall, Larry
4 : a "glug" bit,
5 : 5
6 : Error, Core Dumped
If the second argument to quotewords had been 1
instead of 0, the quotes would have been retained, producing this
output instead:
0 : XYZZY
1 : ""
2 : "O'Reilly, Inc"
3 : "Wall, Larry"
4 : "a \"glug\" bit,"
5 : 5
6 : "Error, Core Dumped"
The other sort of data file is manipulated the same way, but using
our parse_csv1 function instead of
parse_csv0. Notice how the embedded quotes are
doubled, not escaped.
$line = q(Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K);
@fields = parse_csv1($line);
for ($i = 0; $i < @fields; $i++) {
print "$i : $fields[$i]\n";
}
0 : Ten Thousand
1 : 10000
2 : 2710
3 :
4 : 10,000
5 : It's "10 Grand", baby
6 : 10K
1.20.4 See Also
The explanation of regular expression syntax in
perlre(1) and Chapter 5 of Programming
Perl; the documentation for the standard Text::ParseWords
module; the section on "Parsing CSV Files" in Chapter 5 of
Mastering Regular Expressions, Second
Edition
|