Additionally, the decompiled output might be structured somewhat differ-
ently from the original source code because of compiler optimizations. In this
chapter, I will demonstrate several limitations imposed on native code decom-
pilers by modern compilers and show how precious information is often elim-
inated from executable binaries.
Typical Decompiler Architecture
In terms of its basic architecture, a decompiler is somewhat similar to a com-
piler, except that, well... it works in the reverse order. The front end, which is
the component that parses the source code in a conventional compiler, decodes
low-level language instructions and translates them into some kind of inter-
mediate representation. This intermediate representation is gradually
improved by eliminating as much useless detail as possible, while emphasiz-
ing valuable details as they are gathered in order to improve the quality of the
decompiled output. Finally, the back end takes this improved intermediate
representation of the program and uses it to produce a high-level language
representation. The following sections describe each of these stages in detail
and attempt to demonstrate this gradual transition from low-level assembly
language code to a high-level language representation.
Intermediate Representations
The first step in decompilation is to translate each individual low-level instruc-
tion into an intermediate representation that provides a higher-level view of
the program. Intermediate representation is usually just a generic instruction
set that can represent everything about the code.
Intermediate representations are different from typical low-level instruction
sets. For example, intermediate representations typically have an infinite num-
ber of registers available (this is also true in most compilers). Additionally,
even though the instructions have support for basic operations such as addi-
tion or subtraction, there usually aren’t individual instructions that perform
these operations. Instead, instructions use expression trees (see the next sec-
tion) as operands. This makes such intermediate representations extremely
flexible because they can describe anything from assembly-language-like
single-operation-per-instruction type code to a higher-level representation
where a single instruction includes complex arithmetic or logical expressions.
Some decompilers such as dcc [Cifuentes2] have more than one intermedi-
ate representation, one for providing a low-level representation of the pro-
gram in the early stages of the process and another for representing a
higher-level view of the program later on. Others use a single representation
for the entire process and just gradually eliminate low-level detail from the
code while adding high-level detail as the process progresses.
Decompilation 459