Spectral Modeling Synthesis (SMS) generally refers to any parametric creation of a short-time Fourier transform, i.e., something besides simply inverting an STFT, or a nonparametrically modified STFT (such as standard FFT convolution). A primary example of SMS is sinusoidal modeling, and its various extensions, as described further below.
The Fourier duality of the overlap-add and filter-bank-summation short-time Fourier transform (discussed in Chapter 8) appeared in the late 1970s [7,9]. This unification of downsampled filter-banks and FFT processors spawned considerable literature in STFT processing [149,8,208,181,94,180]. While the phase vocoder is normally regarded as a fixed band-pass filter bank. The STFT, in contrast, is usually regarded as a time-ordered sequence of overlapping FFTs (the ``overlap-add'' interpretation). Generally speaking, sound reconstruction by STFT during this period was nonparametric. A relatively exotic example was signal reconstruction from STFT magnitude data [99,181,208].
Since the phase vocoder was in use for measuring amplitude and frequency envelopes for additive synthesis no later than 1977,H.8it is natural to expect that parametric ``inverse FFT synthesis'' from sinusoidal parameters would have begun by that time. Instead, however, traditional banks of sinusoidal (and more general wavetable) oscillators remained in wide use for many more years. Inverse FFT synthesis of sound was apparently first published in 1980 . Thus, parametric reductions of STFT data (in the form of instantaneous amplitude and frequency envelopes of vocoder filter-channel data) were in use in the 1970s, but we were not yet resynthesizing sound by STFT using spectral buffers synthesized from parameters.
Sinusoidal Modeling Systems
With the phase vocoder, the instantaneous amplitude and frequency are normally computed only for each ``channel filter''. A consequence of using a fixed-frequency filter bank is that the frequency of each sinusoid is not normally allowed to vary outside the bandwidth of its channel band-pass filter, unless one is willing to combine channel signals in some fashion which requires extra work. Ordinarily, the band-pass center frequencies are harmonically spaced. I.e., they are integer multiples of a base frequency. So, for example, when analyzing a piano tone, the intrinsic progressive sharpening of its partial overtones leads to some sinusoids falling ``in the cracks'' between adjacent filter channels. This is not an insurmountable condition since the adjacent bins can be combined in a straightforward manner to provide accurate amplitude and frequency envelopes, but it is inconvenient and outside the original scope of the phase vocoder (which, recall, was developed originally for speech, which is fundamentally periodic (ignoring ``jitter'') when voiced at a constant pitch). Moreover, it is relatively unwieldy to work with the instantaneous amplitude and frequency signals from all of the filter-bank channels. For these reasons, the phase vocoder has largely been effectively replaced by sinusoidal modeling in the context of analysis for additive synthesis of inharmonic sounds, except in constrained computational environments (such as real-time systems). In sinusoidal modeling, the fixed, uniform filter-bank of the vocoder is replaced by a sparse, peak-adaptive filter bank, implemented by following magnitude peaks in a sequence of FFTs. The efficiency of the FFT makes it computationally feasible to implement an enormous number of bandpass filters in a fine-grained analysis filter bank, from which the sparse, adaptive analysis filter bank is derived. An early paper in this area is included as Appendix I.
Thus, modern sinusoidal models can be regarded as ``pruned phase vocoders'' in that they follow only the peaks of the short-time spectrum rather than the instantaneous amplitude and frequency from every channel of a uniform filter bank. Peak-tracking in a sliding short-time Fourier transform has a long history going back approximately half a century [199,262]. Sinusoidal modeling based on the STFT of speech was introduced by Quatieri and McAulay [209,159,210,164,180,211]. STFT sinusoidal modeling in computer music began with the development of a pruned phase vocoder for piano tones [255,228] (processing details included in Appendix I).
Inverse FFT Synthesis
As mentioned in the introduction to additive synthesis above (§H.8), systems used an explicit sum of sinusoidal oscillators [157,175,219,255]. For large numbers of sinusoidal components, it is more efficient to use the inverse FFT [225,135,134,132]. See §H.8.1 for further discussion.
In the late 1980s, Serra and Smith combined sinusoidal modeling with noise modeling to enable more efficient synthesis of the noise-like components of sounds [228,231,232]. In this extension, the output of the sinusoidal model is subtracted from the original signal, leaving a residual signal. Assuming that the residual is a random signal, it is modeled as filtered white noise, where the magnitude envelope of its short-time spectrum becomes the filter characteristic through which white noise is passed during resynthesis.
Multiresolution Sinusoidal Modeling
Prior to the late 1990s, both vocoders and sinusoidal models were focused on modeling single-pitched, monophonic sound sources, such as a single saxophone note. Scott Levine showed that by going to multiresolution sinusoidal modeling (see §9.9), it becomes possible to encode general polyphonic sound sources with a single unified system [140,138,139]. ``Multiresolution'' refers to the use of a non-uniform filter bank, such as a wavelet or ``constant Q'' filter bank, in the underlying spectrum analysis. See Fig.9.16 for an example time-frequency resolution grid.
A collection of sound examples illustrating sines+noise+transients
modeling (and various subsets thereof) as well as some audio effects
made possible by such representations, can be found online at
A relatively recent topic in sinusoidal modeling is time-frequency reassignment, in which STFT phase information is used to provide nonlinearly enhanced time-frequency resolution in STFT displays [12,71,78]. The basic idea is to refine the spectrogram (§6.2) by assigning spectral energy in each bin to its ``center of gravity'' frequency instead of the bin center frequency. This has the effect of significantly sharpening the appearance of spectrograms for certain classes of signals, such as quasi-sinusoidal sums. In addition to frequency reassignment, time reassignment is analogous.
It often happens that the model which is most natural from a conceptual (and manipulative) point of view is also the most effective from a compression point of view. This is because, in the ``right'' signal model for a natural sound, the model's parameters tend to vary quite slowly compared with the audio rate. As an example, physical models of the human voice and musical instruments have led to expressive synthesis algorithms which can also represent high-quality sound at much lower bit rates (such as MIDI event rates) than normally obtained by encoding the sound directly [46,242,246,145].
The sines+noise+transients spectral model follows a natural perceptual decomposition of sound into three qualitatively different components: ``tones'', ``noises'', and ``attacks''. This compact representation for sound is useful for both musical manipulations and data compression. It has been used, for example, to create an audio compression format comparable in quality to MPEG-AAC [23,24,16] (at 32 kbits/s), yet it can be time-scaled or frequency-shifted without introducing objectionable artifacts .
Sinusoidal models naturally support time-scale modification (TSM) (and frequency-scaling, its Fourier dual), because the original signal is replaced by oscillator amplitude and frequency envelopes which are easily time-scaled without causing unnatural artifacts. When amplitude or frequency envelopes are rescaled in time, the oscillators are allowed to run continuously under them, thereby avoiding artifacts. Similarly, in sines-plus-noise synthesis, the time-varying noise-filter is a time-frequency envelope that can be smoothly rescaled along either dimension.
When time-stretching, say, a sines+noise model, transients are ``smeared out'' over time. In sines+noise+transients models, transients may be time-shifted instead. However, while this sounds very good when it works, it can be difficult to detect and preserve each and every transient in the signal. In fact, it can become difficult to define exactly what is meant by the term ``transient''. For example, an abrupt cello note onset is naturally defined as a starting transient for the note. If the note onset is then played in a more and more legato style, when is there no longer a starting transient? One idea is to define ``transientness'' as a variable between 0 and 1, so that the degree of TSM ``stretchiness'' for that time interval can be multiplied by (1-transientness).
Audio demonstrations of TSM and frequency-scaling based on the
sines+noise+transients model of Scott Levine may be found online at
See §9.9 for additional details.
A 74-page summary of sinusoidal modeling of sound, including sines+noise modeling is given in . An update on the activities in Xavier Serra's lab in Barcelona is given in a dedicated chapter of the DAFX Book . Additional references related to sinusoidal modeling include [163,80,116,160,155,161,162,81,274,57,224,137,30].
Perceptual audio coding
Speech Synthesis Examples