Reversing : The Hacker's Guide to Reverse Engineering

(ff) #1

Notice how port 0 and port 1 both have double-speed ALUs (arithmetic log-
ical units). This is a significant aspect of IA-32 optimizations because it means
that each ALU can actually perform two operations in a single clock cycle. For
example, it is possible to perform up to four additions or subtractions during
a single clock cycle (two in each double-speed ALU). On the other hand, non-
SIMD floating-point operations are pretty much guaranteed to take at least
one cycle because there is only one unit that actually performs floating-point
operations (and another unit that moves data between memory and the FPU
stack).
Figure 2.4 can help shed light on instruction ordering and algorithms used by
NetBurst-aware compilers, because it provides a rationale for certain otherwise-
obscure phenomenon that we’ll be seeing later on in compiler-generated code
sequences.
Most modern IA-32 compiler back ends can be thought of as NetBurst-
aware, in the sense that they take the NetBurst architecture into consideration
during the code generation process. This is going to be evident in many of the
code samples presented throughout this book.


Branch Prediction

One significant problem with the pipelined approach described earlier has to
do with the execution of branches. The problem is that processors that have a
deep pipeline must always know which instruction is going to be executed
next. Normally, the processor simply fetches the next instruction from memory
whenever there is room for it, but what happens when there is a conditional
branch in the code?
Conditional branches are a problem because often their outcome is not
known at the time the next instruction must be fetched. One option would be
to simply wait before processing instructions currently in the pipeline until the
information on whether the branch is taken or not becomes available. This
would have a detrimental impact on performance because the processor only
performs at full capacity when the pipeline is full. Refilling the pipeline takes
a significant number of clock cycles, depending on the length of the pipeline
and on other factors.
The solution to these problems is to try and predict the result of each condi-
tional branch. Based on this prediction the processor fills the pipeline with
instructions that are located either right after the branch instruction (when the
branch is not expected to be taken) or from the branch’s target address (when
the branch is expected to be taken). A missed prediction is usually expensive
and requires that the entire pipeline be emptied.
The general prediction strategy is that backward branches that jump to an
earlier instruction are always expected to be taken because those are typically
used in loops, where for every iteration there will be a jump, and the only time


Low-Level Software 67
Free download pdf