[ ERCB Home |
New |
Feature |
Brief |
DDJ |
Letters |
Links
]
10 lbs. of Data in a 5-lb. Bag
Review by Andrew Schulman
Copyright (C) Dr. Dobb's Journal, February, 1992
About a year ago, I purchased a highly regarded textbook on text compression
and was pleased to find, among the mathematical formulas, some source code
for a data-compression program. While typing in the code, I marvelled at
how something so seemingly abstract as information theory could yield something
as tangible as reduced hard-disk consumption, shorter transmission times,
and lower online fees. The program was fairly small, and my speculations
on the almost unreasonable effectiveness of information theory were put
on hold while I compiled and ran the program.
Well, the resulting data-compression program turned out to have an interesting
property: Much as a failed alchemist might turn gold into lead, it made
all files larger; the accompanying decompression program (yeah, I typed
that in too) made them smaller. You can probably appreciate the fact that
I have since learned to be more cautious when purchasing books on this subject.
Data compression really does have a crucial theoretical foundation, making
a pleasant contrast to the ad hoc mess that exists in most other areas of
computing. But as the compression program that makes everything larger shows,
this theoretical foundation is not always exploited in the most practical
or productive way.
Mark Nelson to the Rescue! Nelson's The Data Compression Book now
stands as the best all-around book on this subject, providing both the crucial
mathematical foundations of information theory and genuine, working C code
that actually compresses. Nelson, vice president of development for Greenleaf
Software, takes the reader through a variety of data compression techniques,
following their historical progression. First, the book covers statistics-based
techniques, showing the evolution from Huffman's 1952 landmark "method
for the construction of minimum-redundancy codes," to the need for
adaptive coding, to arithmetic coding, which answered the need for coding
with a nonintegral number of bits. The next three chapters deal with dictionary-based
compression, which began with Ziv and Lempel's universal algorithm paper
from 1977, and some variety of which is used in most contemporary compression
software, such as PKZIP, LHArc, and the mighty Stacker. Next are two chapters
on "lossy" compression, without dedicated hardware, of sound and
graphics. Each technique is demonstrated with a detailed walkthrough of
working C code; the code is also provided on the two accompanying disks.
One of the key points to emerge from the book is the importance of the model
used for compression. Data compression, Nelson points out, consists not
of mere "coding," that is, of deciding how best to represent symbols
based on the probability of their appearance in a stream of data, but, equally
important, of "modeling," which is what determines the probability
in the first place. A good example of what Nelson means by "model"
is the difference between viewing a file as a collection of characters on
the one hand, and viewing it as a collection of pairs or triplets of characters
on the other.
As discussed in Kas Thomas's article, "Entropy" (DDJ, February
1991), the minimum number of bits required to represent a symbol, based
on the probability of its appearance, is known as its entropy. But Nelson
makes the important point that
...unlike the thermodynamic measure of entropy, we can use no
absolute number for the information content of a given message. The problem
is that when we calculate entropy, we use a number that gives us the probability
of a given symbol. The probability figure we use is actually the probability
for a given model, not an absolute number. If we change the model, the probability
will change with it.... This seemingly unstable notion of a character's
probability proves troublesome to many people. They prefer that a character
have a fixed "true" probability.
But, in dictionary-based compression in particular, the model chosen is
all important. Another interesting point is that the history of data compression
techniques relates closely to the hardware available at any given time.
Abstract algorithms were discovered at just about the time it became practical
to implement them, given typical memory availability and CPU speeds. The
genesis of dictionary-based techniques in 1977 is a good example.
While walking through the entire history of data compression, Nelson takes
the reader all the way to the present, covering the despised Unisys LZW
patent, ARC, PKZIP, Yoshi's LHArc, Stac Electronics' QIC-122 compression
standard, and so on. While some of this material has appeared previously
in Nelson's DDJ articles, almost all of it is new.
For the many programs that come with the book, Nelson has built an extremely
nice foundation library, into which he plugs the different data compression
and decompression routines. At a time when you see so much repetitive code
in programming books, with each program differing by only a few lines from
the preceding one (books on Windows programming are particularly bad in
this way), it's nice to see Nelson's book practicing "compression"
at this conceptual level. This is truly "minimum redundancy coding!"
I was a little surprised, though, by the book's lack of emphasis on speed
of compression and decompression. This seems an important subject. The run
time for some of the programs could have been halved simply by using the
C setvbuf() function to read and write more data at a time. Also, the BIT-IO
module seems somewhat inefficient; running the compression programs under
a profiler indicates that almost all time is spent here, rather than in
the compression routines themselves. BIT-IO uses a "rack" in which
bits are stored until a complete byte can be written out to a file; I suspect
that writing out a 4-byte long instead would improve the performance. But
it is probably unfair to complain about this, in what is really a wonderfully
enjoyable book on programming. Even someone who isn't a C programmer will
probably get something from this inside look at how programs such as PKZIP
and Stacker operate.
Silicon Dreams
Nelson notes that "data compression is perhaps the fundamental expression
of Information Theory." There is something paradoxical about information
theory, because meaningless gibberish has greater entropy, that is, is worth
more bits, than the relatively structured gibberish that I'm writing now.
Information theory often comes under attack for neglecting "meaning."
For example, the brilliant art critic Rudolf Arnheim, in his Entropy
and Art, complains of "the information theorist, who persists in
ignoring structure." But the proof of the pudding is in the eating:
Completely random banging on keys does not compress as much as the text
file I'm writing. The existence of data compression programs, and of efficient
programs that not only detect but correct errors, shows that Claude Shannon
was, to put it mildly, on to something.
There are, of course, many good books on the subject of information theory.
Shannon's original paper from 1948, together with an essay by Warren Weaver,
is still in print as The Mathematical Theory of Communication, and
is amazingly readable and relevant after over 40 years.
A well-known popular account is John Pierce's Introduction to Information
Theory: Symbols, Signals, and Noise. A really popular account is Jeremy
Campbell's Grammatical Man: Information, Entropy, Language, and Life.
The problem with these popular surveys is that they bite off more than they
can chew: Much as critics of information theory demand that it somehow take
into account "meaning," the authors of these popular works on
information theory try to make it explain, in the course of 300 pages, life,
the universe, and everything. This in spite of Shannon's own warning, in
a 1956 article titled "The Bandwagon" (quoted in Campbell's Grammatical
Man), that information theory "has perhaps ballooned to an importance
beyond its actual accomplishments.... Seldom do more than a handful of nature's
secrets give way at one time."
Robert Lucky's recent popular explanation of information theory, Silicon
Dreams, does a wonderful job precisely because--like information theory
itself -- it does not try to explain everything. The mere fact that data
can be compressed is, when you think about it, mysterious enough without
also dragging in any possible connection between this and the ultimate heat
death of the universe. In contrast to other books on information theory
for a general audience, Lucky sticks solidly to subjects such as Shannon's
theorem, the nature of text, text input and output, speech, speech processing,
graphics, and graphics processing.
Those, in fact, are the key chapters in Lucky's book. For example, there
is a solid 50-page chapter on text, which includes in-depth looks at the
statistical properties of English and at text compression. The chapter on
text input has a wonderful discussion of the nature of typing and of keyboards.
(The persistence of the QWERTY keyboard against innovations such as Dvorak's
might serve as a lesson for anyone who wants to replace an intrenched mess
with a brand-spanking-new innovation.) Another 50-page chapter discusses
speech processing, including coding, synthesis, and recognition. A chapter
on picture processing looks in depth at the storage, generation, processing,
and coding of graphics.
Most impressive in all this is that Lucky, executive director of research
at AT&T Bell Laboratories, has managed to discuss all this in a book that
could easily be understood by one's non-computer spouse. Basically, Silicon
Dreams is a very readable survey, for a general audience, of signal
processing, the Fourier transform, Lempel-Ziv, Huffman coding, linear predictive
coding, and other topics you probably wouldn't expect to find discussed
outside a for-professionals-only text. Anyone who wants to understand the
theoretical foundation that makes possible software such as PKZIP and Stacker
will want to read Silicon Dreams. Even professionals who already
know about some of these topics will benefit from reading the book, because
it will probably teach them some things they didn't already know, and it
will certainly help put what they do know into a wider context.
Lucky's chapters on sound and graphics--like the code-intensive chapters
on the subject in Nelson's data compression books -- are important to anyone
interested in multimedia. I particularly recommend Lucky's account of the
failure of Picturephone, which seems to me to say something about multimedia
today.
The Entropy Problem
As I've said, Lucky concentrates on the mysterious-enough hard facts of
information theory, and leaves alone the more nebulous questions in which
others have become mired. This includes the ultimate question, which is
what, if anything, the entropy of information theory has to do with the
entropy of Clausius's second law of thermodynamics. Of course, they may
have nothing to do with each other; it is reported that John von Neumann
suggested that Shannon call his measure "entropy," because "no
one knows what entropy is, so in a debate you will always have the advantage."
In addition, Lucky notes that the striking similarity of the formulas involved
may reflect little more than the fact that "any analysis of order and
disorder would result in similar logarithmic equations." In any case,
he writes, "such philosophical arguments might be made appropriately
late at night over a bottle of wine." I've decided to take his advice
literally. It's late at night, and I've been into the wine, so I feel fully
equipped to talk about our final book this month, Maxwell's Demon.
This is a collection of 25 papers on the link between information theory
and the second law of thermodynamics, as they relate to a mythical being,
Maxwell's demon, who uses observation alone to defy the second law. The
general idea is that, if the type of entropy one might read about in a book
on data compression is the same as the entropy that Clausius said tends
to a maximum, then the demon is consistent with the second law.
Some of the flavor of these papers comes out in a remark in one of them,
Charles Bennett's "Notes on the history of reversible computation,"
which refers to "the realization that one bit of information is somehow
equivalent to k ln 2 units of entropy, or about 2.3 X 10{24} cal/Kelvin."
This is a remarkable realization!
Other papers included in Maxwell's Demon are Szilard's 1929 classic
"On the decrease of entropy in a thermodynamic system by the intervention
of intelligent beings," Landauer's "Computation: a fundamental
physical view" and "Irreversibility and heat generation in the
computing process," and Bennett's "The thermodynamics of computation."
Surprisingly, one viewpoint not well represented here is that of Edward
Fredkin, who is concerned, not with the physical basis of computation, but
with -- get this -- the computational basis of physical reality. The world,
in other words, is a computer, and reality is digital, not analog. (I wonder
what size hard disk it takes. Will it run Stacker?) Fredkin's reversible
logic gates and billiard-ball computer are discussed in some of the papers
in Maxwell's Demon, but for his wider views you might enjoy the biographical
sketch of Fredkin (who was reportedly the model for the character Professor
Steven Falken in the 1983 movie War Games) in Robert Wright's Three Scientists
and Their Gods. It goes very nicely with the Chianti.
The Data Compression Book
Mark Nelson
Redwood City, California: M&T Books, 1991
527 pages (two disks), $36.95
ISBN 1-55851-216-0
Silicon Dreams: Information, Man, and Machine
Robert W. Lucky
New York, NY: St. Martin's Press, 1989
411 pages, $13.95
ISBN 0-312-05517-X
Maxwell's Demon: Entropy, Information, Computing
Harvey S. Leff and Andrew F. Rex (editors)
Princeton, NJ: Princeton University Press, 1990
348 pages, $24.95
ISBN 0-691-08727-X
Electronic Review of Computer Books
Created 5/1/96 / Last modified 6/15/96 / webmaster@ercb.com