Electronic Review of Computer Books

[ ERCB Home | New | Feature | Brief | DDJ | Letters | Links ]

[an error occurred while processing this directive]

Vital Statistics

Title Data Munging with Perl
Author David Cross
Publisher Manning Publications
http://www.browsebooks.com
Copyright 2001
ISBN 1-930110-00-6
Pages 283
Price $36.95


To Munge or Not To Munge

I have been reading several introductory Perl books recently and thought Data Munging with Perl, by David Cross, looked like a good second Perl book. After all, what the author calls Data Munging -- reading and writing data, converting data from one format to another -- is firmly within the computing mainstream. And the Web has not given us any less data or any fewer data formats to deal with.

But what is "munging"? Perl's interpreted nature and obscure-looking syntax appeal to me. Perhaps it is this same bent that causes me veritable excitement upon reading the assertion in Chapter 2, that "most data munging tasks look like: read input, Munge (or process), write output."

In a day in which the prominent development approach consists of objects whose interactions are not known until run time, and where a hot development technique more resembles a new variety of computing buddy system, this assertion has an appealing historical ring to it. This is something COBOL programmers and programmers coding UNIX filters can agree upon.

Cross clearly believes that fiddling with data and converting among formats are still important in the life of the working programmer, and that Perl is the language for the task.

For instance, UNIX-style filters are high on the list of techniques Cross recommends for munging. Other tips are "don't throw anything away" (sometimes it pays to read in more data than you currently need), design your data structures well in the beginning, and "don't do too much processing in the input routine." The third means to leave something for the munging (processing) routine to do.

Data Munging with Perl's 12 chapters address progressively more "interesting" types of data, with strategies for dealing with each. Each section includes an introductory rationale followed by examples in Perl.

Cross sometimes refines these examples. My favorite section: "How Not To Parse HTML" in Chapter 8 (by annihilating everything between "<" and ">") is followed in Chapter 9 with several Perl add-in modules that do the parsing for you.

Techniques are also presented from simple to complex. Reading data line-by-line into an array of strings or array of hashes may work for record-oriented data (Chapter 6), whereas parsing with an extension module (XML::Parser), works better for data with strict idiosyncratic structure.

Chapter 7 provides a discussion on reading binary with read() and unpack() that I found useful.

I also like the explanation of regular expressions in Chapter 4, which starts with simple examples that become more all encompassing. For example, /regular expression/ matches "regular expression" and /[a-z]/ matches the lowercase letters.

Similar for the explanation of parsers on (page 159): tokens, rules, grammars, top-down and bottom-up parsing. This is lucid stuff for such a short space and a topic that has so much theory attached to it.

However, Data Munging with Perl suffers from the "same page syndrome."

While writing a Java book, I was beset by the urge (which I now think misguided) to write the "short history of programming languages" in an introductory chapter. You know the stuff: Languages evolved from ML to assembler, which gave way to high-level languages. My editor averred, "That's fine to include Doug, just so everyone is on the same page."

This attitude stems from the goal of publishers to sell a book to a cross-over audience such as Intermediate to Advanced. The result is that there are conservatively hundreds of computer books on the market that say the same things.

And this syndrome also subjects readers to some strange contradictions. On page 139 of Data Munging, you have the author introducing ASCII text, what it is and how it takes more space to store than the same data in binary. But the Perl examples in this book can only be understood by a veteran. If I can read the Perl in this book without help, why would I not know about ASCII already?

But this minor flaw only annoyed me a little, making Data Munging a bit wordier than it might have been. With its narrow focus on a language of current interest, this book does not quite rise to the level of "Software Tools," but it still shows some good Perl programming, and provides convincing evidence of the value of data structures beyond the halls of academia along the way.

-- Doug Nickerson (dougnickerson@yahoo.com)


Quick Rating

Readability Star Star HalfStar
Originality Star Star Star HalfStar
Organization Star Star Star
Accuracy Star Star Star
Consistency Star Star Star
Depth Star Star HalfStar
Timeliness Star Star HalfStar
Editing Star Star Star HalfStar
Design Star Star Star
Overall Value Star Star Star

Explanation of ERCB rating scale:
No stars = unacceptable
1 Star = marginal
2 Stars = average
3 Stars = above average
4 Stars = exceptional


Copyright © 2001 Electronic Review of Computer Books
Created 10/8/2001 / Last modified 10/8/2001 / webmaster@ercb.com