In an etymology-based world, "optimization" books would show
how to make the "best" code for a given algorithm, and "best"
would cover a range of concepts: clarity, portability, connectivity, and
the like. In such a world, the title of Michael Abrash's book, Zen of
Code Optimization would be Zen of Code Acceleration because it's about
making programs go faster--period. That's a narrow way to define optimizing,
but there are enough generalist books out already. Speed specialists will
know that this book is for them as soon as they read the first sentence
in the Preface: "This book is the diary of a personal passion, my quest
for ways to write the fastest possible software for IBM-compatible computers
in C, C++, and assembly language."
In fact, the approximate ratios are 5 percent C++, 25 percent C, and 70
percent assembly language. No surprise there: To really control the cycles,
programmers have to use assembly language. But Zen of Code Optimization
introduces this proposition in a sober fashion: Chapter 1 is an example
of a C program that really can't be improved by just dropping in some inline
assembly code. The first step--and Zen never stints on the warnings that
this must be the first step--is improve the algorithm by timing it, timing
the alternatives, stopping and thinking, deciding if any performance improvement
would really be worth the effort, and then--only then--rewriting the critical
routine in assembly language.
But how can you make your routine faster if you can't figure out how fast
the routine is running, before and after their changes? Zen solves this
problem in Chapter 3 by introducing a timing program called "the Zen
Timer," which comes on the 3.5-inch diskette packaged with the book.
Certainly, a profiler will do a better job of describing what routines are
in use and where apparent bottlenecks lie. However, a profiler is not a
precise instrument. Typically, profilers depend on the computer's system
clock to interrupt them, which happens 18.2 times a second--a relatively
infrequent occurrence on a computer that's doing several million instructions
per second. Not only that, the profiler and the clock-interrupt-routine
themselves consume cycles, so the act of measurement affects the thing being
measured. To truly time something, you have to run a routine several million
times with varying clock speeds in a loop.
The Zen Timer presents no such difficulties. It works by reprogramming the
8253 (or equivalent) timer chip that comes standard with every "IBM
compatible" computer. The 8253 is incrementing an internal counter
approximately 1,000,000 times per second. The Zen Timer retrieves the 8253's
counter value, then it masks all interrupts and executes whatever routine
has to be timed. As soon as that's over, it gets the 8523's current count
(the 8523 keeps incrementing independently of what's happening on the CPU).
It subtracts the count value that it got before entering the loop, and lo!
The result is a timing of the routine's speed to the nearest microsecond,
give or take a bit (some caveats apply).
As far as I can tell though, Zen of Code Optimization does not mention
a potentially irritating detail: The Zen Timer does not work from Windows
or from the Windows DOS Box. So the question arises, if you have to go outside
Windows when you use the Zen Timer to test a routine, are the results valid
when we put the same routine in a Windows application? In any ordinary situation,
yes. There are a few instructions which run differently in Windows' Enhanced
386 mode: POP ES; MOV DS,AX; or anything else that causes a segment identifier
to change. Still, such instructions are too rare to make the Zen Timer's
results meaningless. This utility is the best thing about the book.
Unfortunately the x86 instruction set doesn't provide any way to work directly with only the upper half of a 32-bit register. The next best solution is to rotate the register to give you access in the lower 16 bits to the half you need at any particular time....Zen dismisses using ROR for this, because "shifts and rotates are among the worst performing instructions of the 486, taking two to three cycles to execute." To the rescue comes BSWAP, which "executes in just 1 cycle." The discussion culminates in Example 4, which shows that you can use the top half of the ECX register as a loop counter and the bottom half as a "skip value," swapping the top half with the bottom half when you need to directly refer to the loop-counter half.
MOV BX,offset Mem ADD word ptr [bx+4],55(b)
MOV CX,offset Mem ADD word ptr [BX+4],55Example 2: These two instructions run in three cycles.
MOV AL,5 ;"Loading" AL, which is one half of AX MOV DX,AX ;Using AX, the whole register, as the sourceExample 3: (a) CX is the destination; (b) there is no execution penalty because a penalty is already being applied.
DEC CL is slower than DEC CL SUB CX,5 SUB AX,5(b)
MOV BL,BL is NOT slower than MOV BL,BL MOV [BX],BX MOV [BX],CXExample 4: You can use the top half of the ECX register as a loop counter.
mov cx,[InitialValue] bswap ecx ;Put skip value in upper half of ECX mov cx,64h ;Put loop count in CX looptop:... bswap ecx ;Make skip value word accessible in CX add bx,cx ;Skip BX ahead inc cx ;Set next skip value bswap ecx ;Put loop count in CX dec cx ;Count down loop jnz looptop ;The loop will repeat 64h times.Example 5: This code is faster than using BSWAP.
mov cx,[InitialValue] and ecx,000ffffh ;Clear the upper part of ECX to 0 or ecx,00630000h ;Put 63h directly in the upper part of ECX looptop:... add bx,cx ;Skip BX ahead inc cx ;Set next skip value sub ecx,00010000h ;Count down loop jnc looptop ;The loop will repeat 64h times.