of the telephony system are used. Hence, no
echo cancellation is performed by the telephony
system even though long distance calls are made.
That forces VoIP to include echo cancellation
among the services supplied.
Voice Activity Detection
In the ordinary telephone network a two-way
simultaneous link is set up between sender and
receiver. This link carries voice at a rate of
64 kb/s in both directions. Usually only one of
the parties is active at any one time; even the
active part has breaks and pauses in a normal
speech pattern. Hence, the utilization of this
two-way link is most of the time less than 40 %.
This fact could be used in VoIP to enhance the
performance of the transmission and less band-
width is required to obtain better speech quality
using voice activity detection (VAD). A generic
outlook of the VAD algorithm is depicted in Fig-
ure 5, where it is shown that the algorithm works
by detecting the magnitude (dB) and then decid-
ing when the voice is inactive and thereby stop-
ping the transmission of packets in that direction
for the moment. To be on the safe side, when
cutting the transmission the algorithm waits a
fixed amount of time, hang-over time, after it
detects a drop in the voice magnitude before it
totally stops the voice sample packet transmis-
sion. The hang-over time duration is in the mag-
nitude of hundreds of ms (typically 150–250
ms). Another problem is to differ between voice
and background noise, and to calibrate itself the
VAD is disabled at the beginning of new calls.
However, even after that it could be cumber-
some to detect when a new voice spurt occurs.
The algorithm cut-offs the beginning of each
new voice spurt and waits until it is sure that it is
a new voice spurt and not, for example, a noise
peak. This phenomenon is called front-end
speech clipping, and is usually not noticeable
for the listener.
Standards
Interoperability among VoIP products has been a
major stumbling block to widespread acceptance
of the technology. The ITU’s H.323 umbrella
standard, shown in Figure 6, which was the first
posed for VoIP interoperability, proved complex
and difficult to implement. As a result, other
less-unwieldy standards were posed in its place
and until recently, we have seen little consensus
on which VoIP standards that would be the most
widely implemented. Even though the H.323
standard is the dominating standard at present,
most vendors foresee a coexistence of several
standards in the arena for quite some time. The
most supported standard is H.323 version 2, but
version 3 and 4 are rapidly catching up. (It
should be pointed out that H.323 version 1 is not
forward compatible with the latter standards of
H.323.) Other supported standards are SIP (Ses-
sion Initiated Protocol) by IETF, the Media
Gateway Control Protocol (MGCP) and H.248.
SIP is an application layer signalling protocol
that specifies call control for multiparty sessions,
IP phones or multimedia distribution. Unlike
H.323, which is based on binary encoding, SIP
is a text-based protocol that is usually easier to
implement. Further information regarding SIP
could be found in [6,7].
MGCP is designed as a simple mechanism to
mainly control the gateways. Its function is to
control the gateways while relying on external
call control intelligence for more complex func-
tions. With the MGCP model, the gateway
focuses on the audio signal translation function
while a call agent, external to the gateway, han-
dles the signalling and call processing functions.
By separating out the internal gateway functions
from the external signalling function, the imple-
mentation, upgrade and maintenance of the gate-
way are reduced to a minimum. This increases
the likelihood of widespread use of this technol-
Figure 5 The voice activity
detection (VAD) algorithm,
used to decrease the required
bandwidth for VoIP calls
dB Magnitude Hang-over Magnitude
Front-end
speech clipping
Time
Noise floor
Front-end
speech clipping