Search Spectral Audio Signal Processing
Would you like to be notified by email when Julius Orion Smith III publishes a new entry into his blog?
The first major effort to encode speech electronically was Homer Dudley's vocoder (``voice coder'') [60] developed starting in October of 1928 at AT&T Bell Laboratories [223].
On analysis, the outputs of ten analog bandpass filters (spanning
250-3000 Hz) were rectified and lowpass-filtered to obtain amplitude
envelopes for each band. In parallel, the fundamental frequency
was measured, and a voiced/unvoiced decision was made (unvoiced
segments were indicated by
. On synthesis, a ``buzz source''
(relaxation oscillator) at pitch
(for voiced speech) or a ``hiss
source'' (for unvoiced speech) was used to drive a set of ten
matching bandpass filters, whose outputs were summed to produce the
reconstructed voice. While the voice quality had a quite noticeable
``unpleasant electrical accent'' [223],
the bandwidth required to transmit
and the bandpass-filter gain envelopes was much less than
that required to transmit the original speech.
A manually controlled version of the vocoder synthesis engine, called the Voder (Voice Operation Demonstrator [65]), was constructed and demonstrated at the 1939 World's Fairs in New York and San Francisco [60]. Pitch was controlled by a foot pedal, and ten fingers controlled the bandpass gains. Buzz/hiss selection was by means of a wrist bar. Three additional keys controlled transient excitation of selected filters to achieve stop-consonant sounds [65]. ``Performing speech'' on the Voder required on the order of a year's training before intelligible speech could reliably be produced. The Voder was a versatile performing instrument having intriguing possibilities beyond voice synthesis.
The vocoder synthesis model can be considered a
source-filter model for speech which includes a
non-parametric spectral model of the vocal tract given by the
output of a fixed bandpass-filter-bank over time. Related efforts
included the formant vocoder (Munson and Montgomery 1950)--a
type of parametric spectral model--which encoded
and the
amplitude and center-frequency of the first three spectral
formants. See [152, pp. 2452-3] for an overview
and references.
The phase vocoder, developed for speech coding by Flanagan and Golden [66], extended the vocoder to include the starting phase of each filter-bank channel output signal. (After time zero, the phase in each channel is given by the starting phase plus the integral of the instantaneous frequency in that channel.) Unlike the hardware implementations of the channel vocoder, the phase vocoder was implemented in software on top of a Short-Time Fourier Transform (STFT), and it used additive synthesis for reconstructing the signal from its amplitude and ``phase derivative'' (instantaneous frequency) spectrum [66]. Time scale modification and frequency shifting were early applications of the phase vocoder [66].
The phase vocoder can also be considered an early subband coder [262]. Since the mid-1970s, subband coders have typically been implemented using the STFT [66,195,10]. In the field of perceptual audio compression, additional compression has been obtained using undersampled filter banks that provide aliasing cancellation [264], the first example being the Princen-Bradley filter bank [198].
Since the phase vocoder, sound synthesis from vocoder analysis data has traditionally been called ``additive synthesis'' in the computer-music field. Additive synthesis was one of the first computer-music synthesis methods [168,215], and it has been a mainstay ever since. Additive synthesis was historically implemented using a sum of sinusoidal oscillators modulated by amplitude and frequency envelopes over time [168], and later using the inverse FFT [30,221] when the number of components is large.
