P1: C-172
Kroon WL040/Bidgolio-Vol I WL040-Sample.cls June 20, 2003 13:9 Char Count= 0
AUDIOCODINGTECHNIQUES 315of this process. The auditory system has been shown to
have masking properties. Masking describes the process
in which one signal becomes inaudible in the presence
of another signal. In other words, under certain circum-
stances it is possible to make quantization noise inaudi-
ble while the decoded audio signal is present. Masking
can happen both in time and in frequency. Understand-
ing the principles of masking has taken many decades of
research using both physiological and physical measure-
ments. Masking data are obtained through psychoacous-
tical studies, in which subjects are exposed to test signals
and asked about their ability to detect changes (increase
in frequency, audibility, etc). Most of the understanding
of masking is based on simple tones and noise. Because
complex signals can be viewed as composites of time-
varying tones, a common approach has been to derive the
masked threshold by analyzing the signal tone by tone in
specific frequency bands, related to the frequency bands
used by the human auditory system to analyze sounds.
These bands, called critical bands, are spaced nonuni-
formly with increasing bandwidth for higher frequencies.
In each critical band, the signal and its corresponding
masking function are calculated, and the masked thresh-
old is derived as a superposition over the complete fre-
quency band. It should be noted that the actual procedure
is much more complicated, taking into account several
interactions and characteristics of the signals (e.g., if the
signal is noiselike or tonelike). Figure 11 gives an example
of the power spectrum of a signal and its corresponding
masked threshold. In this figure, as long as the quantiza-
tion noise remains below the solid line, it will be inaudible.
In general the model that is used to derive the masked
thresholds is referred to as thepsychoacousticorpercep-
tual model. Building a good perceptual model and match-
ing it properly to a given coder structure is a very complex
task. It also should be noted that for a given coder struc-
ture (or coder standard such as MPEG 1, Layer 3), it is
possible to improve the perceptual model while still being
compliant to the standard. This also explains the quality
differences in various encoders that all support the same
standard.
To achieve redundancy removal, and at the same time
take advantage of the frequency-domain masking proper-
ties, it is beneficial to perform a spectral decomposition60Power (dB)-40
0Hz 16kHz
FrequencyFigure 11: Example of signal frequency power spectrum and
its corresponding masked threshold (solid stepped line).of the signal by means of a filter bank or transform. Most
modern audio coders are based on some form of lapped
transform, which not only provides computational effi-
ciency, but also allows perfect reconstruction. In other
words the transform and its inverse will produce a de-
layed version of the original time signal. The transform
sizes and overlaps will be chosen in such a way that the
signal is critically sampled (i.e., the number of frequency
components is the same as the number of time samples).
The size of the transform will determine the spectral res-
olution. A larger transform will provide better spectral
resolution at the expense of decreased time resolution. A
common solution is to make the transform size adapt to
the signal. For stationary signals, a large transform size
is chosen, whereas for nonstationary signals or onset’s a
smaller transform size is chosen. Typically sizes vary from
256 to 2,048 samples for sampling rates in the range 20
to 40 kHz. This process is called window switching, and
care is taken to make sure that the whole process is invert-
ible. A commonly used transform is the modified discrete
cosine transform (MDCT). It uses a transform of length
2M samples, which advances M samples between adjacent
transforms. It is critically sampled and only M coefficients
are generated for each 2M set of input samples. Compu-
tationally efficient implementations have contributed to
the widespread use of the MDCT in many audio coding
standards.
Figure 12 shows a generic block diagram of an au-
dio encoder incorporating the filter bank and the percep-
tual model. The resulting spectral components (MDCT
coefficients) are quantized in such a way that the result-
ing quantization noise is inaudible. This is accomplished
by using the masking level obtained from theperceptual
model.The amount of noise is controlled by the resolu-
tion of the quantizer. By choosing a different quantizer
step size the amount of noise can be quantized. Typi-
cally the same quantizer is used for a set of coefficients,
and the corresponding step size for that set is transmit-
ted to the decoder. It should be noted that at this point,
due to quantization, the decoded signal will be different
from the original (i.e., lossy coding). To accomplish per-
ceptually lossless coding, we need to make sure that the
quantization noise remains below the masked threshold.
The redundancy removal is accomplished by encoding the
quantizer indices with lossless coding techniques (e.g.,
Huffman coding). To avoid confusion it should be clear
that this is a lossless coding technique on quantizer in-
dices; hence the overall operation is still a lossy codingAnalysis
Filter
BankAnalysis
Filter
BankQuantizerQuantizer Noiseless
CodingNoiseless
Coding BitstreamFormatterBitstream
FormatterPerceptual
ModelPerceptual
ModelNoise
AllocationNoise
Allocationaudio
inputIRRELEVANCY REMOVAL REDUNDANCY
REMOVAL
Figure 12: Block diagram of generic audio encoder.