Voice Synthesis

Free Books Physical Audio Signal Processing

Unquestionably, the most extensive prior work in the 20th century relevant to virtual acoustic musical instruments occurred within the field of speech synthesis [139,142,363,408,335,106,243].^A.11 This research was driven by both academic interest and the potential practical benefits of speech compression to conserve telephone bandwidth. It was clear at an early point that the bandwidth of a telephone channel (nominally 200-3200 Hz) was far greater than the ``information rate'' of speech. It was reasoned, therefore, that instead of encoding the speech waveform, it should be possible to encode instead more slowly varying parameters of a good synthesis model for speech.

Before the 20th century, there were several efforts to simulate the voice mechanically, going back at least until 1779 [140].

Dudley's Vocoder

The first major effort to encode speech electronically was Homer Dudley's vocoder (``voice coder'') [119] developed starting in October of 1928 at AT&T Bell Laboratories [414]. A manually controlled version of the vocoder synthesis engine, called the Voder (Voice Operation Demonstrator [140]), was constructed and demonstrated at the 1939 World's Fairs in New York and San Francisco [119]. Pitch was controlled by a foot pedal, and ten fingers controlled the bandpass gains. Buzz/hiss selection was by means of a wrist bar. Three additional keys controlled transient excitation of selected filters to achieve stop-consonant sounds [140]. ``Performing speech'' on the Voder required on the order of a year's training before intelligible speech could reliably be produced. The Voder was a very interesting performable instrument!

The vocoder and Voder can be considered based on a source-filter model for speech which includes a non-parametric spectral model of the vocal tract given by the output of a fixed bandpass-filter-bank over time. Later efforts included the formant vocoder (Munson and Montgomery 1950)--a type of parametric spectral model--which encoded and the amplitude and center-frequency of the first three spectral formants. See [306, p. 2452-3] for an overview and references.

While we have now digressed to into the realm of spectral models, as opposed to physical models, it seems worthwhile to point out that the early efforts toward speech synthesis were involved with essentially all of the mainstream sound modeling methods in use today (both spectral and physical domains).

Vocal Tract Analog Models

There is one speech-synthesis thread that clearly classifies under computational physical modeling, and that is the topic of vocal tract analog models. In these models, the vocal tract is regarded as a piecewise cylindrical acoustic tube. The first mechanical analogue of an acoustic-tube model appears to be a hand-manipulated leather tube built by Wolfgang von Kempelen in 1791, reproduced with improvements by Sir Charles Wheatstone [140]. In electrical vocal-tract analog models, the piecewise cylindrical acoustic tube is modeled as a cascade of electrical transmission line segments, with each cylindrical segment being modeled as a transmission line at some fixed characteristic impedance. An early model employing four cylindrical sections was developed by Hugh K. Dunn in the late 1940s [120]. An even earlier model based on two cylinders joined by a conical section was published by T. Chiba and M. Kajiyama in 1941 [120]. Cylinder cross-sectional areas were determined based on X-ray images of the vocal tract, and the corresponding characteristic impedances were proportional to . An impedance-based, lumped-parameter approximation to the transmission-line sections was used in order that analog LC ladders could be used to implement the model electronically. By the 1950s, LC vocal-tract analog models included a side-branch for nasal simulation [131].

The theory of transmission lines is credited to applied mathematician Oliver Heaviside (1850-1925), who worked out the telegrapher's equations (sometime after 1874) as an application of Maxwell's equations, which he simplified (sometime after 1880) from the original 20 equations of Maxwell to the modern vector formulation.^A.12 Additionally, Heaviside is credited with introducing complex numbers into circuit analysis, inventing essentially Laplace-transform methods for solving circuits (sometime between 1880 and 1887), and coining the terms `impedance' (1886), `admittance' (1887), `electret', `conductance' (1885), and `permeability' (1885). A little later, Lord Rayleigh worked out the theory of waveguides (1897), including multiple propagating modes and the cut-off phenomenon.^A.13

Singing Kelly-Lochbaum Vocal Tract

In 1962, John L. Kelly and Carol C. Lochbaum published a software version of a digitized vocal-tract analog model [245,246]. This may be the first instance of a sampled traveling-wave model of the vocal tract, as opposed to a lumped-parameter transmission-line model. In other words, Kelly and Lochbaum apparently returned to the original acoustic tube model (a sequence of cylinders), obtained d'Alembert's traveling-wave solution in each section, and applied Nyquist's sampling theorem to digitize the system. This sampled, bandlimited approach to digitization contrasts with the use of bilinear transforms as in wave digital filters; an advantage is that the frequency axis is not warped, but it is prone to aliasing when the parameters vary over time (or if nonlinearities are present).

At the junction of two cylindrical tube sections, i.e., at area discontinuities, lossless scattering occurs.^A.14As mentioned in §A.5.4, reflection/transmission at impedance discontinuities was well formulated in classical network theory [34,35], and in transmission-line theory.

The Kelly-Lochbaum model can be regarded as a kind of ladder filter [297] or, more precisely, using later terminology, a digital waveguide filter [433]. Ladder and lattice digital filters can be used to realize arbitrary transfer functions [297], and they enjoy low sensitivity to round-off error, guaranteed stability under coefficient interpolation, and freedom from overflow oscillations and limit cycles under general conditions. Ladder/lattice filters remain important options when designing fixed-point implementations of digital filters, e.g., in VLSI. In the context of wave digital filters, the Kelly-Lochbaum model may be viewed as a digitized unit element filter [136], reminiscent of waveguide filters used in microwave engineering. In more recent terminology, it may be called a digital waveguide model of the vocal tract in which the digital waveguides are degenerated to single-sample width [433,442,452].

In 1961, Kelly and Lochbaum collaborated with Max Mathews to create what was most likely the first digital physical-modeling synthesis example by any method.^A.15 The voice was computed on an IBM 704 computer using speech-vowel data from Gunnar Fant's recent book [132]. Interestingly, Fant's vocal-tract shape data were obtained (via x-rays) for Russian vowels, not English, but they were close enough to be understandable. Arthur C. Clarke, visiting John Pierce at Bell Labs, heard this demo, and he later used it in ``2001: A Space Odyssey,''--the HAL9000 computer slowly sang its ``first song'' (``Bicycle Built for Two'') as its disassembly by astronaut Dave Bowman neared completion.^A.16

Perhaps due in part to J. L. Kelly's untimely death afterward, research on vocal-tract analog models tapered off thereafter, although there was some additional work [306]. Perhaps the main reason for the demise of this research thread was that spectral models (both nonparametric models such as the vocoder, and parametric source-filter models such as linear predictive coding (discussed below)) proved to be more effective when the application was simply speech coding at acceptably low bit rates and high fidelity levels. In telephone speech-coding applications, there was no requirement that a physical voice model be retained for purposes of expressive musical performance. In fact, it was desired to automate and minimize the ``performance expertise'' required to operate the voice production model. One could go so far as to say that the musical expressivity of voice synthesis models reached their peak in the 1939 Voder and related (manual) systems (§A.6.1).

In computer music, the Kelly-Lochbaum vocal tract model was revived for singing-voice synthesis in the thesis research of Perry Cook [87].^A.17 In addition to the basic vocal tract model with side branch for the nasal tract, Cook included neck radiation (e.g., for `b'), and damping extensions. Additional work on incorporating damping within the tube sections was carried out by Amir et al. [15]. Other extensions include sparse acoustic tube modeling [149] and extension to piecewise conical acoustic tubes [507]. The digital waveguide modeling framework [430,431,433] can be viewed as an adaptation of extremely sparse acoustic-tube models for artificial reverberation, vibrating strings, and wind-instrument bores.

Linear Predictive Coding of Speech

Approximately a decade after the Kelly-Lochbaum voice model was developed, Linear Predictive Coding (LPC) of speech began [20,296,297]. The linear-prediction voice model is best classified as a parametric, spectral, source-filter model, in which the short-time spectrum is decomposed into a flat excitation spectrum multiplied by a smooth spectral envelope capturing primarily vocal formants (resonances).

LPC has been used quite often as a spectral transformation technique in computer music, as well as for general-purpose audio spectral envelopes [381], and it remains much used for low-bit-rate speech coding in the variant known as Codebook Excited Linear Prediction (CELP) [337].^A.18When applying LPC to audio at high sampling rates, it is important to carry out some kind of auditory frequency warping, such as according to mel, Bark, or ERB frequency scales [182,459,482].

Interestingly, it was recognized from the beginning that the all-pole LPC vocal-tract model could be interpreted as a modified piecewise-cylindrical acoustic-tube model [20,297], and this interpretation was most explicit when the vocal-tract filters (computed by LPC in direct form) were realized as ladder filters [297]. The physical interpretation is not really valid, however, unless the vocal-tract filter parameters are estimated jointly with a realistic glottal pulse shape. LPC demands that the vocal tract be driven by a flat spectrum--either an impulse (or low-pitched impulse train) or white noise--which is not physically accurate. When the glottal pulse shape (and lip radiation characteristic) are ``factored out'', it becomes possible to convert LPC coefficients into vocal-tract shape parameters (area ratios). Approximate results can be obtained by assuming a simple roll-off characteristic for the glottal pulse spectrum (e.g., -12 dB/octave) and lip-radiation frequency response (nominally +6dB /octave), and compensating with a simple preemphasis characteristic (e.g., dB/octave) [297]. More accurate glottal pulse estimation in terms of parameters of the derivative-glottal-wave models by Liljencrants, Fant, and Klatt [133,257] (still assuming +6dB/octave for lip radiation) was carried out in the thesis research of Vicky Lu [290], and further extension of that work appears in [250,213,251].

Formant Synthesis Models

A formant synthesizer is a source-filter model in which the source models the glottal pulse train and the filter models the formant resonances of the vocal tract. Constrained linear prediction can be used to estimate the parameters of formant synthesis models, but more generally, formant peak parameters may be estimated directly from the short-time spectrum (e.g., [255]). The filter in a formant synthesizer is typically implemented using cascade or parallel second-order filter sections, one per formant. Most modern rule-based text-to-speech systems descended from software based on this type of synthesis model [255,256,257].

Another type of formant-synthesis method, developed specifically for singing-voice synthesis is called the FOF method [386]. It can be considered an extension of the VOSIM voice synthesis algorithm [219]. In the FOF method, the formant filters are implemented in the time domain as parallel second-order sections; thus, the vocal-tract impulse response is modeled a sum of three or so exponentially decaying sinusoids. Instead of driving this filter with a glottal pulse wave, a simple impulse is used, thereby greatly reducing computational cost. A convolution of an impulse response with an impulse train is simply a periodic superposition of the impulse response. In the VOSIM algorithm, the impulse response was trimmed to one period in length, thereby avoiding overlap and further reducing computations.

The FOF method also tapers the beginning of the impulse-response using a rising half-cycle of a sinusoid. This qualitatively reduces the ``buzziness'' of the sound, and compensates for having replaced the glottal pulse with an impulse. In practice, however, the synthetic signal is matched to the desired signal in the frequency domain, and the details of the onset taper are adjusted to optimize audio quality more generally, including to broaden the formant resonances.

One of the difficulties of formant synthesis methods is that formant parameter estimation is not always easy [408]. The problem is particularly difficult when the fundamental frequency is so high that the formants are not adequately ``sampled'' by the harmonic frequencies, such as in high-pitched female voice samples. Formant ambiguities due to insufficient spectral sampling can often be resolved by incorporating additional physical constraints to the extent they are known.

Formant synthesis is an effective combination of physical and spectral modeling approaches. It is a physical model in that there is an explicit division between glottal-flow wave generation and the formant-resonance filter, despite the fact that a physical model is rarely used for either the glottal waveform or the formant resonator. On the other hand, it is a spectral modeling method in that its parameters are estimated by explicitly matching short-time audio spectra of desired sounds. It is usually most effective for any synthesis model, physical or otherwise, to be optimized in the ``audio perception'' domain to the extent it is known how to do this [312,165]. For an illustrative example, see, e.g., [201].

Voice Synthesis

Dudley's Vocoder

Vocal Tract Analog Models

Singing Kelly-Lochbaum Vocal Tract

Linear Predictive Coding of Speech

Formant Synthesis Models

Further Reading in Speech Synthesis

Sign in

About this Book

Physical Audio Signal Processing

Blogs - Hall of Fame

Free PDF Downloads

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group