The Internet Encyclopedia (Volume 3)

(coco) #1

P1: C-172


Kroon WL040/Bidgolio-Vol I WL040-Sample.cls June 20, 2003 13:9 Char Count= 0


SPEECHCODINGTECHNIQUES 311

Figure 6: Time waveform of the utterance “Why were you away a year, Roy?” spoken by
a female talker (top) and corresponding spectrogram (bottom). The spectrogram shows the
power spectrum as a function of time, where the gray level indicates the power level, from
low (white) to high (black).

and are caused by resonant frequencies in the vocal tract
in response to the glottal excitation signal. Changing the
shape of the mouth cavity over time changes the reso-
nant frequencies. The relative position of these formants
defines the consonants and vowels that we perceive. One
can also see a more harmonic component, which is due to
the periodic excitation provided by the vocal cords. This
is referred to as thepitchof a sound and varies between
100 and 250 Hz for males and 200 and 400 Hz for females.
All these features are produced by the human articula-
tory system, which changes slowly in time, because it is a
biological system driven by slowly moving muscles. As a
result it was found very effective to represent the formants
by a slowly varying adaptive digital filter, the so-called
linear or short-term predictionfilter. With only 10 predic-
tor coefficients updated once every 20 ms it is possible to
accurately reproduce the evolution of speech formants.
Figure 7 shows a short segment of a speech signal
and the corresponding signal after filtering with an adap-
tive linear prediction filter, called the residualsignal.
Because this signal is better behaved it usually easier to
quantize with fewer bits per sample. The predictor co-
efficients have to be quantized before transmission, and
numerous methods are available. The most effective
methods require only 1,500 to 2,000 bits/s to transmit 10
coefficients every 10 to 20 ms. By recognizing the peri-
odicity in the signal for voiced sounds (e.g., vowels) it
is possible to further improve predictor efficiency. A so-
calledlong-term or pitch predictoris able to predict the
periodic component, resulting in an even more noise-
like residual signal. The long-term predictor consists of a

variable delay line with a few filter coefficients. The de-
lay and coefficients are updated once every 5 to 10 ms.
The most typical configuration is forward adaptive, which
means that about 1,500 to 2,000 bits per second are
needed for transmitting the long-term predictor parame-
ters. Figure 8 shows the signals after short-term and long-
term prediction, respectively.
The signal shown at the bottom of Figure 8 resembles
a noiselike signal, with a reduced correlation and reduced
dynamic range. As a result it can be quantized more
efficiently.
At this point we have all the components needed for
a linear predictive coder. A block diagram is shown in
Figure 9. The signal is filtered with the short-term filter
A(z) and long-term filterP(z) and the remaining residual
signal is quantized. The predictor parameters and quan-
tized residual signal are transmitted or stored. The de-
coder, after decoding the quantized residual signal, filters
it through the inverse long-term and short-term predic-
tion filters. Note that without quantization of the resid-
ual signal the decoder can exactly reproduce the original
signal. For quantization of the residual signal many tech-
niques exist. The simplest quantizers are scalar uniform
or nonuniform PCM quantizers. For acceptable results at
least 4 to 5 bits per sample are needed. Even refinements
such as adaptive quantization, in which the quantizer step
sizes are adjusted over time, will not reduce the number of
bits per sample significantly. More efficient quantization
can be obtained through the use ofvector quantization
(VQ), in which multiple samples are quantized simulta-
neously.
Free download pdf