140 Audition
Common Onsets and Offsets
It is often the case that although sounds from different
sources may occur at about the same time, one sound may
come on or go off at a slightly different time than another
sound. When this happens, all of the temporal-spectral char-
acteristics of one sound come on and go off at a different time
than that occurring for the other sound. Thus, the common
onset or offset of these spectral-temporal cues could be used
for sound source segregation.
Asynchronous onsets, and in some cases offsets, have
been shown to provide powerful cues for sound source segre-
gation (Yost & Sheft, 1993). In some cases, onset cues can be
used to amplify other cues that might be used for sound
source segregation. As described above in the section on
pitch, a harmonic sequence can produce a complex pitch
equal to the fundamental frequency of the complex. If two
complexes with different fundamentals are mixed, in most
conditions listeners do not perceive the two pitches corre-
sponding to the original two fundamental frequencies. The
spectral characteristics of the new complex consisting of the
mixture of the two harmonic sequences appear to be analyzed
as a whole (synthetically). However, if one of the harmonic
complexes is turned on slightly before (50 ms) the other har-
monic complex, listeners often perceive the two pitches, even
though for most of the time (perhaps for a second) the two
harmonic complexes occur together (Darwin, 1981).
Common Modulation
Most everyday sound sources impart a slow amplitude and fre-
quency modulation (change) to the overall spectral-temporal
properties of the sound from the source. Each sound source
will produce a different pattern of modulation, and these mod-
ulation patterns may allow for sound source segregation (Yost
& Sheft, 1993). When a person speaks, the vocal cords open
and close in a nearly periodic manner that determines the pitch
of a voice (Fowler chapter in this volume). However, the fre-
quency of these glottal openings varies (frequency modula-
tion, voicing vibrato) slightly, and the amplitude of air
released by each opening also randomly varies (amplitude
modulation, voicing jitter) over a small range. Each person has
a different pattern of vibrato and jitter. Speech sounds can be
artificially generated (via computer) such that the speech (see
Fowler chapter in this volume) is produced with constant glot-
tal frequency and amplitude. If two such constant speech
sounds are generated and mixed, it is often difficult to segre-
gate the two sounds into the two different speech signals.
However, if random variation is introduced into the computer-
generated glottal openings and closing (random vibrato and
jitter), segregation can occur (McAdams, 1984).
Thus, common amplitude and frequency modulation may
be possible cues for sound source segregation. However, fre-
quency modulation per se is probably not a cue used for
sound source segregation (Carylon, 1991), but amplitude
modulation is most likely a useful cue. Two experimental
procedures have been extensively studied to investigate the
role of amplitude modulation in auditory processing: comod-
ulation masking release (CMR) and modulation detection
interference (MDI).
In a typical CMR experiment (Hall, Haggard, & Fernandes,
1984; Yost & Sheft, 1993) listeners are asked to detect a tonal
signal spectrally centered in the middle of a narrow band
of noise (target band). In one condition, the detection of the
signal is compared to a case in which another narrow band of
noise (the flanking band) is simultaneously added in another
region of the spectrum. The addition of this flanking band has
little effect on signal threshold in the target band, if the target
and flanking bands are completely independent. This is con-
sistent with the critical-band view of auditory processing, in
that the flanking band falls outside the spectral region of the
critical band of the target band and therefore should have little
influence on signal detection within the target band. However,
if the target and flanking band are dependent in that they have
the same pattern of amplitude modulation (they are comodu-
lated), then signal threshold for the target band is lowered
by 10–15 dB. This improvement in signal threshold due to
comodulation is referred to as CMR, and the results from a
typical experiment are shown in Figure 5.15.
The CMR results suggest that the common modulation
increases the listener’s ability to detect the signal. One expla-
nation of these results is based on the assumption that co-
modulation groups the flanking and target bands into one
perceived sound source that contains more information than
that in a single band. Independent (non-comodulated) bands
of noise would not come from a single sound source and
therefore would not be grouped together. The additional in-
formation in the combined (grouped) sound might aid signal
detection. For instance, it might make the valleys of low am-
plitude in the modulated noises more obvious, increasing the
ability of the auditory system to detect the tone occurring
in these valleys. The addition of the signal changes the cor-
relation between the signal-plus-masking stimuli and the
masking-alone stimuli. The combined stimulus may increase
this correlation, increasing signal detection.
In an MDI condition (Yost, 1992b; Yost & Sheft, 1993),
listeners are asked to discriminate between two amplitude-
modulated tonal carrier signals (the probe stimuli) on the
basis of the depth of the amplitude modulation. Threshold
performance is typically a 3% change in the depth of ampli-
tude modulation. If a tone of a different frequency and not