The Internet Encyclopedia (Volume 3)

(coco) #1

P1: C-172


Kroon WL040/Bidgolio-Vol I WL040-Sample.cls June 20, 2003 13:9 Char Count= 0


310 SPEECH ANDAUDIOCOMPRESSION

Assessment of speech is even more involved, because
it involves two-way communication and most material is
context-sensitive. Mean opinion scores (MOS) testing in
which panels of listeners rate the quality of short sen-
tences on a 5-point scale is the most common. Besides
speech quality one can also test speech intelligibility, al-
though this is usually not an issue for telephony band-
width speech at bit rates of 4 kb/s and higher.
It should be appreciated that none of the testing
methods described above can fully predict how people
will experience the quality in a real-world scenario, which
involves talking to people whose voices they know or lis-
tening to music they like.
Tests with human subjects are expensive and time-
consuming, and one would like to use objective measures
that could predict subjective quality based on a compar-
ison between the original and the processed version. For
lossy compression simple objective measures such as seg-
mental signal-to-noise (SNR) measurements are mean-
ingless. A more effective approach is to include models
that mimic our auditory system and use the resulting
model output to predict subjective quality. Two standards
based on such an approach have recently been recom-
mended: ITU-R PEAQ for the assessment of audio com-
pression techniques and ITU-T P.862 for the assessment
of telephony-quality speech compression techniques. Al-
though these methods have shown to be quite accurate for
some scenarios, they should always be used with caution
and with a clear understanding of their shortcomings.

SPEECH CODING TECHNIQUES
Efficient digital presentations of speech signals have been
a topic of research since the 1940s, but only since the
early 1990s have many applications become technically
and economically feasible. Digital cellular telephony has
been one of the main applications for speech coding, and
many digitization choices were made to be compatible
with wired digital networks. For example, the speech sig-
nals are sampled at 8 kHz (thereby limiting the signal
bandwidth to 4 kHz), single channel (mono), and 8 to 16
bits/sample. The communication application puts a con-
straint on thedelayintroduced by the compression op-
eration. Not only is it is difficult to have a natural two-
way conversation with delays exceeding 250 ms, but it
is also more noticeable to hear echoes introduced, for
example, by the acoustic coupling between loudspeaker
and microphone (e.g., a speakerphone) at either end of
the communication link. For conferencing applications,
which involve more than two parties, each party will hear
the combination signal for all other participants. Because
this combining of the signals needs to be done in the PCM
domain it is necessary to decompress the signals, digi-
tally combine them, and compress them again. The de-
lay introduced by compression will now be compounded,
thereby reinforcing the problems mentioned above. Most
compression algorithms introduce delay because they an-
alyze the signal in blocks or frames with a duration of 10
to 30 ms. Analysis in frames is necessary to better charac-
terize the signal behavior and its variations.
For communication applications it is also important to
put constraints on thecomplexityof the compression

operation because each end-point needs both an encoder
and a decoder. This is even more relevant for wireless
applications where the end-point is battery-powered and
high complexity will reduce battery life. Complexity is de-
fined in terms of computational load (MIPS) and memory
usage (RAM and ROM). For most speech and audio cod-
ing algorithms there is an asymmetry in complexity, and
the encoder can be several more times complex than the
decoder.
Most speech coders are based on the lossy compression
paradigm, taking advantage of the properties of the audi-
tory system and the properties of the speech production
mechanism. The latter can be taken advantage of by using
so-calledparametric coders. With these coders the speech
signal is modeled by a limited number of parameters,
which are usually related to the physical speech produc-
tion mechanism. The parameters are obtained by analyz-
ing the speech signal and quantized before transmission.
The decoder will use these parameters to reconstruct a
rendering of the original signal. When the input and out-
put waveforms are compared, the resemblance may be
weak but the signals may sound very similar. Using para-
metric approaches, it is feasible to achieve reasonable
quality with very low bit rates (2 to 4 kb/s). The quality
is limited by the accuracy of the model. This is illustrated
in Figure 5. To avoid this limit on quality, a more com-
mon approach is to use waveform-approximating coders.
These coders maintain the waveform of the original sig-
nal as much as possible while taking advantage of the
properties of both the speech production and auditory
mechanisms. The resulting quality is better at the expense
of higher bit rates, and at lower bit rates the quality of
a waveform coder will be less than that of a parametric
coder operating at the same low rate (see Figure 5).
Speech as produced by humans has certain proper-
ties that can be taken advantage of for compression. It
has limited energy above 8 kHz, and it has a limited
dynamic range. This allows sampling with frequencies
between 8 and 16 kHz and PCM quantization with 12–
16 bits/sample. Using nonuniform quantization (e.g., the
quantizer step sizes are small for small input values and
large for large input values), it is possible to quantize tele-
phone speech with 8 bits per sample. Figure 6 shows a
waveform and its corresponding spectrogram. From look-
ing at this figure one can see that the envelope of the am-
plitudes change slowly as a function of time. The spec-
trogram shows that certain frequencies are stronger than
others. These emphasized frequencies are calledformants

Waveform-approximating coders

1 2 4 8 16 32 64
kb/s

Speech Quality

Parametric coders (model based)

Figure 5: Quality vs. bit rate curves for waveform and para-
metric coders.
Free download pdf