Spectral Modeling Synthesis

Free Books Spectral Audio Signal Processing

As introduced in §10.4, Spectral Modeling Synthesis (SMS) generally refers to any parametric recreation of a short-time Fourier transform, i.e., something besides simply inverting an STFT, or a nonparametrically modified STFT (such as standard FFT convolution).^G.11 A primary example of SMS is sinusoidal modeling, and its various extensions, as described further below.

Short-Term Fourier Analysis, Modification, and Resynthesis

The Fourier duality of the overlap-add and filter-bank-summation short-time Fourier transform (discussed in Chapter 9) appeared in the late 1970s [7,9]. This unification of downsampled filter-banks and FFT processors spawned considerable literature in STFT processing [158,8,219,192,98,191]. While the phase vocoder is normally regarded as a fixed bandpass filter bank. The STFT, in contrast, is usually regarded as a time-ordered sequence of overlapping FFTs (the ``overlap-add'' interpretation). Generally speaking, sound reconstruction by STFT during this period was nonparametric. A relatively exotic example was signal reconstruction from STFT magnitude data (magnitude-only reconstruction) [103,192,219,20].

In the speech-modeling world, parametric sinusoidal modeling of the STFT apparently originated in the context of the magnitude-only reconstruction problem [221].

Since the phase vocoder was in use for measuring amplitude and frequency envelopes for additive synthesis no later than 1977,^G.12it is natural to expect that parametric ``inverse FFT synthesis'' from sinusoidal parameters would have begun by that time. Instead, however, traditional banks of sinusoidal (and more general wavetable) oscillators remained in wide use for many more years. Inverse FFT synthesis of sound was apparently first published in 1980 [35]. Thus, parametric reductions of STFT data (in the form of instantaneous amplitude and frequency envelopes of vocoder filter-channel data) were in use in the 1970s, but we were not yet resynthesizing sound by STFT using spectral buffers synthesized from parameters.

Sinusoidal Modeling Systems

With the phase vocoder, the instantaneous amplitude and frequency are normally computed only for each ``channel filter''. A consequence of using a fixed-frequency filter bank is that the frequency of each sinusoid is not normally allowed to vary outside the bandwidth of its channel bandpass filter, unless one is willing to combine channel signals in some fashion which requires extra work. Ordinarily, the bandpass center frequencies are harmonically spaced. I.e., they are integer multiples of a base frequency. So, for example, when analyzing a piano tone, the intrinsic progressive sharpening of its partial overtones leads to some sinusoids falling ``in the cracks'' between adjacent filter channels. This is not an insurmountable condition since the adjacent bins can be combined in a straightforward manner to provide accurate amplitude and frequency envelopes, but it is inconvenient and outside the original scope of the phase vocoder (which, recall, was developed originally for speech, which is fundamentally periodic (ignoring ``jitter'') when voiced at a constant pitch). Moreover, it is relatively unwieldy to work with the instantaneous amplitude and frequency signals from all of the filter-bank channels. For these reasons, the phase vocoder has largely been effectively replaced by sinusoidal modeling in the context of analysis for additive synthesis of inharmonic sounds, except in constrained computational environments (such as real-time systems). In sinusoidal modeling, the fixed, uniform filter-bank of the vocoder is replaced by a sparse, peak-adaptive filter bank, implemented by following magnitude peaks in a sequence of FFTs. The efficiency of the split-radix, Cooley-Tukey FFT makes it computationally feasible to implement an enormous number of bandpass filters in a fine-grained analysis filter bank, from which the sparse, adaptive analysis filter bank is derived. An early paper in this area is included as Appendix H.

Thus, modern sinusoidal models can be regarded as ``pruned phase vocoders'' in that they follow only the peaks of the short-time spectrum rather than the instantaneous amplitude and frequency from every channel of a uniform filter bank. Peak-tracking in a sliding short-time Fourier transform has a long history going back at least to 1957 [210,281]. Sinusoidal modeling based on the STFT of speech was introduced by Quatieri and McAulay [221,169,222,174,191,223]. STFT sinusoidal modeling in computer music began with the development of a pruned phase vocoder for piano tones [271,246] (processing details included in Appendix H).

Inverse FFT Synthesis

As mentioned in the introduction to additive synthesis above (§G.8), typical systems originally used an explicit sum of sinusoidal oscillators [166,186,232,271]. For large numbers of sinusoidal components, it is more efficient to use the inverse FFT [239,143,142,139]. See §G.8.1 for further discussion.

Sines+Noise Synthesis

In the late 1980s, Serra and Smith combined sinusoidal modeling with noise modeling to enable more efficient synthesis of the noise-like components of sounds (§10.4.3) [246,249,250]. In this extension, the output of the sinusoidal model is subtracted from the original signal, leaving a residual signal. Assuming that the residual is a random signal, it is modeled as filtered white noise, where the magnitude envelope of its short-time spectrum becomes the filter characteristic through which white noise is passed during resynthesis.

Multiresolution Sinusoidal Modeling

Prior to the late 1990s, both vocoders and sinusoidal models were focused on modeling single-pitched, monophonic sound sources, such as a single saxophone note. Scott Levine showed that by going to multiresolution sinusoidal modeling (§10.4.4;§7.3.3), it becomes possible to encode general polyphonic sound sources with a single unified system [149,147,148]. ``Multiresolution'' refers to the use of a non-uniform filter bank, such as a wavelet or ``constant Q'' filter bank, in the underlying spectrum analysis.

Transient Models

Another improvement to sines+noise modeling in the late 1990s was explicit transient modeling [6,149,147,144,148,290,282]. These methods address the principal remaining deficiency in sines+noise modeling, preserving crisp ``attacks'', ``clicks'', and the like, without having to use hundreds or thousands of sinusoids to accurately resynthesize the transient.^G.13

The transient segment is generally ``spliced'' to the steady-state sinusoidal (or sines+noise) segment by using phase-matched sinusoids at the transition point. This is usually the only time phase is needed for the sinusoidal components.

To summarize sines+noise+transient modeling of sound, we can recap as follows:

sinusoids efficiently model tonal signal components
filtered-noise efficiently models the what's left after removing the tonal components from a steady state spectrum
transients should be handled separately to avoid the need for many sinusoids

So, although sinusoids are sufficiently general thanks to Fourier's theorem the combination of sines, filtered-noise, and transient segments can provide a much more compact basis for audio signals. Such compact building-blocks for sound are useful for audio coding and manipulation.

Time-Frequency Reassignment

A relatively recent topic in sinusoidal modeling is time-frequency reassignment, in which STFT phase information is used to provide nonlinearly enhanced time-frequency resolution in STFT displays [12,73,81]. The basic idea is to refine the spectrogram (§7.2) by assigning spectral energy in each bin to its ``center of gravity'' frequency instead of the bin center frequency. This has the effect of significantly sharpening the appearance of spectrograms for certain classes of signals, such as quasi-sinusoidal sums. In addition to frequency reassignment, time reassignment is analogous.

Perceptual Audio Compression

It often happens that the model which is most natural from a conceptual (and manipulation) point of view is also the most effective from a compression point of view. This is because, in the ``right'' signal model for a natural sound, the model's parameters tend to vary quite slowly compared with the audio rate. As an example, physical models of the human voice and musical instruments have led to expressive synthesis algorithms which can also represent high-quality sound at much lower bit rates (such as MIDI event rates) than normally obtained by encoding the sound directly [46,259,262,154].

The sines+noise+transients spectral model follows a natural perceptual decomposition of sound into three qualitatively different components: ``tones'', ``noises'', and ``attacks''. This compact representation for sound is useful for both musical manipulations and data compression. It has been used, for example, to create an audio compression format comparable in quality to MPEG-AAC [24,25,16] (at 32 kpbs), yet it can be time-scaled or frequency-shifted without introducing objectionable artifacts [149].