Formant Synthesis Models

A formant synthesizer is a source-filter model in which the source models the glottal pulse train and the filter models the formant resonances of the vocal tract. Constrained linear prediction can be used to estimate the parameters of formant synthesis models, but more generally, formant peak parameters may be estimated directly from the short-time spectrum (e.g., [255]). The filter in a formant synthesizer is typically implemented using cascade or parallel second-order filter sections, one per formant. Most modern rule-based text-to-speech systems descended from software based on this type of synthesis model [255,256,257].

Another type of formant-synthesis method, developed specifically for singing-voice synthesis is called the FOF method [386]. It can be considered an extension of the VOSIM voice synthesis algorithm [219]. In the FOF method, the formant filters are implemented in the time domain as parallel second-order sections; thus, the vocal-tract impulse response is modeled a sum of three or so exponentially decaying sinusoids. Instead of driving this filter with a glottal pulse wave, a simple impulse is used, thereby greatly reducing computational cost. A convolution of an impulse response with an impulse train is simply a periodic superposition of the impulse response. In the VOSIM algorithm, the impulse response was trimmed to one period in length, thereby avoiding overlap and further reducing computations.

The FOF method also tapers the beginning of the impulse-response using a rising half-cycle of a sinusoid. This qualitatively reduces the ``buzziness'' of the sound, and compensates for having replaced the glottal pulse with an impulse. In practice, however, the synthetic signal is matched to the desired signal in the frequency domain, and the details of the onset taper are adjusted to optimize audio quality more generally, including to broaden the formant resonances.

One of the difficulties of formant synthesis methods is that formant parameter estimation is not always easy [408]. The problem is particularly difficult when the fundamental frequency $ F_0$ is so high that the formants are not adequately ``sampled'' by the harmonic frequencies, such as in high-pitched female voice samples. Formant ambiguities due to insufficient spectral sampling can often be resolved by incorporating additional physical constraints to the extent they are known.

Formant synthesis is an effective combination of physical and spectral modeling approaches. It is a physical model in that there is an explicit division between glottal-flow wave generation and the formant-resonance filter, despite the fact that a physical model is rarely used for either the glottal waveform or the formant resonator. On the other hand, it is a spectral modeling method in that its parameters are estimated by explicitly matching short-time audio spectra of desired sounds. It is usually most effective for any synthesis model, physical or otherwise, to be optimized in the ``audio perception'' domain to the extent it is known how to do this [312,165]. For an illustrative example, see, e.g., [201].

Next Section:
Further Reading in Speech Synthesis
Previous Section:
Linear Predictive Coding of Speech