Book Review Archive

Data Munging with Perl

Book Reviews

Title: Data Munging with Perl
Author: David Cross
Publisher: Manning
Pages: 283
ISBN: 1930110006
Reviewer: Bob Gattis
Rating 4.5 out of 5

I asked to review this book for a specific reason. Much of the programming I do now, about 10% of my work time, deals with data conversion. From database records to proprietary formats, from HTML to knowledge base formats, and similar tasks. I have learned enough Perl over the past two years to maintain and extend some existing scripts created by a subcontractor on a project. My Perl experience level would have to be classified as semi-beginner, since I have to consult my reference texts before tackling a new script. As a side note, I have used Python as my primary scripting language for the past several years, with Perl used as a secondary language.

With that background, I was interested to see if this book was a useful reference to a Perl beginner with an understanding of, and some experience in, creating data conversion scripts. I should also point out that I run the ActiveState Win32 version of Perl, under Win 2K, and I was curious if the author dealt with the differences between UNIX and Win32 implementations. Cross does deal with this issue, pointing out differences in some of the modules, based on platform.

I give the book a rating of 4.5 of a possible 5. Indeed, I gave the book my ultimate high rating: I purchased a copy for my Perl reference library!

The book presents techniques for 'munging' data - reading data in, processing it (by reformatting or translating) and then writing the data out. As the author states, these are precisely the types of problems Perl was designed to solve. I found the sample problems and the author's solutions to be very well done. I especially liked the design tips that the author included in chapter 2, which tracked very well with my experience doing data conversion scripts in C++ and Python.

Data formats presented are:

text and binary data variable and fixed width record formats CSV and other delimited formats
databases with DBI HTML and LWP module XML
unstructured data parsers for custom formats  

Data transformations presented include:

filtering single and multi-key sorting pattern matching
regular expressions text transformations numerical formats
date manipulation    

I especially enjoyed the sorting discussions of the Schwartzian Transform and the Orcish Manoeuvre. Cross explains how to circumvent sorting overhead by using two sorting techniques that are unique to Perl. These two techniques reduce performance overhead by caching the results of the most expensive operations in the Perl sort, making them less expensive "data lookups".

I also liked the explanation of regular expressions in Chapter 4, which starts with simple examples that become more all encompassing.

The book also includes a very useful Perl module reference appendix. The author has documented a few modules you're most likely to use, as well as a few methods you're most likely to need. This is a great time-saver for someone looking for a quick "data conversion" fix to a problem.

In summary, this book is probably not for advanced Perl programmers. It would serve as a good reference for an intermediate

Last Updated: 27 May 2002

Send comments and questions about this Website to the PPPM Webmaster.