The Internet Encyclopedia (Volume 3)

(coco) #1

P1: C-172


Kroon WL040/Bidgolio-Vol I WL040-Sample.cls June 20, 2003 13:9 Char Count= 0


Speech and Audio CompressionSpeech and Audio Compression


Peter Kroon,Agere Systems

Introduction 307
Compression for Packet Networks 308
Speech and Audio Quality Assessment 309
Speech Coding Techniques 310
Speech Coding Standards 313
Audio Coding Techniques 314
Audio Coding Standards 316

Applications 317
Internet Telephony 317
Audio Streaming 318
Conclusion 319
Glossary 319
Cross References 319
Further Reading 319

INTRODUCTION
Audible signals such as speech and music are acous-
tic analog waveforms, pressure changes that propagate
through a medium such as air or water. The waveforms
are created by a vibrating source such as a loudspeaker
or musical instrument and detected by a receptor such as
a microphone diaphragm or eardrum. An example of a
simple waveform is a pure tone, a periodic signal that re-
peats many times per second. The number of repetitions
per second is itsfrequencyand is measured in Hertz (Hz).
Audible tones are typically in a range from 20 to 20,000
Hz, which is referred to as thebandwidthof audible sig-
nals. The tone will create a sound pressure displacement
that is related to its amplitude. Signals with high ampli-
tude will sound louder than signals with low amplitude,
and the range from soft to loud is called thedynamic range.
Complex sounds (e.g., the sound of a piano, or speech)
consist of combinations of many tones of different fre-
quencies and amplitudes that vary over time.
Using a microphone one can capture an acoustic wave-
form and convert it into an electric signal or waveform.
This signal can be converted back to an acoustic signal by
using a loudspeaker. To represent this analog waveform
as a digital signal, it is necessary to find a numerical rep-
resentation that preserves its characteristics. The process
of converting an analog signal to a digital signal is usually
referred to asdigitization. Digital representation of audio
and speech signals has many advantages. It is easier to
combine with other media such as video and text, and
it is easier to make the information secure by applying
encryption. Digital representations also allow procedures
to protect against impairments when transmitting the sig-
nals over error-prone communication links. The main dis-
advantage is that straightforward digitization of analog
signals results in data rates that require much more ca-
pacity of the physical channel than the original analog
signal.
Before we provide some examples of this dilemma, let
us first take a look at the principles of digitization. To
digitize an analog audio signal it is necessary to sample
the signal at discrete instants of time at a rate equivalent
to twice the highest bandwidth that exists in the signal
(this is thesamplingorNyquist theorem). The frequency
that the signal is sampled with is referred to as thesam-
pling frequency. Typical sampling frequencies for speech

signals are between 8 and 16 kHz, whereas for music sig-
nals ranges between 16 and 48 kHz are more common. To
get a digital representation, the sample values need to be
a discrete set of numbers represented by a binary code.
This process is referred to asquantization. In contrast
to sampling, which allows one to perfectly reconstruct
the original analog waveform, quantization will introduce
errors that will remain after the analog signal is recon-
structed. The quantization error (defined as the difference
between the analog sample value and the discrete value)
can be made smaller by using more bits per sample. For
example, an 8-bit number allows 2 to the power of 8= 256
different values, whereas a16-bit number allows 65,536
different values. For speech signals between 8 and 16 bits
per sample are adequate, whereas for high-quality music
signals between 16 and 24 bits per sample are commonly
used.
The process of sampling and quantization described
above is referred to as pulse coded modulation (PCM).
The total bit rate per second for a PCM signal is given by
(sampling rate)×(number of bits per sample)×(number
of audio channels).
For a stereo signal on a compact disc this means
44,100× 16 × 2 =1,411,200 bits per second (1,411 Mb/s).
As is illustrated in Figure 1, the typical bit rates needed
for various signals can be quite high. For storage and
transmission purposes these rates quickly become pro-
hibitive. Although disc-based storage has become cheaper
and high-speed Internet connections are more common-
place, it is still difficult to stream compact disc data
directly at about 1.4 Mb/s or to store hundreds of uncom-
pressed CD’s on a hard disk.
The main goal ofaudio and speech compressionis to
find a more efficient digital representation of these sig-
nals. Because most signals start as simple sampled PCM
signals, it is useful to use the resulting relative reduction in
bit rate as a measure of efficiency. It should be pointed out
that a reduction in bit rate is not always the main objec-
tive. For example, one could increase the bit rate to make
the signal more robust against transmission errors. In that
case the generic termcodingis more appropriate. In this
chapter we will use both terms interchangeably. Figure 2
shows the generic compression operation. Theencoder
takes the PCM input signal and generates the compressed
bit stream. The bit stream is either transmitted to the

307
Free download pdf