## Spectral Modeling Synthesis

As introduced in §10.4, *Spectral Modeling Synthesis* (SMS)
generally refers to any *parametric* recreation of a short-time
Fourier transform, *i.e.*, something besides simply inverting an STFT, or
a nonparametrically modified STFT (such as standard FFT
convolution).^{G.11} A
primary example of SMS is *sinusoidal modeling*, and its various
extensions, as described further below.

### Short-Term Fourier Analysis, Modification, and Resynthesis

The Fourier duality of the overlap-add and filter-bank-summation
short-time Fourier transform (discussed in Chapter 9) appeared in
the late 1970s [7,9]. This
unification of downsampled filter-banks and FFT processors spawned
considerable literature in STFT processing
[158,8,219,192,98,191].
While the phase vocoder is normally regarded as a fixed bandpass
filter bank. The STFT, in contrast, is usually regarded as a
time-ordered sequence of overlapping FFTs (the ``overlap-add''
interpretation). Generally speaking, sound reconstruction by STFT
during this period was nonparametric. A relatively exotic example was
signal reconstruction from STFT magnitude data (*magnitude-only
reconstruction*)
[103,192,219,20].

In the speech-modeling world, parametric sinusoidal modeling of the STFT apparently originated in the context of the magnitude-only reconstruction problem [221].

Since the phase vocoder was in use for measuring amplitude and
frequency envelopes for additive synthesis no later than
1977,^{G.12}it is natural to expect that parametric ``inverse FFT synthesis'' from
sinusoidal parameters would have begun by that time. Instead,
however, traditional banks of sinusoidal (and more general wavetable)
oscillators remained in wide use for many more years. Inverse FFT
synthesis of sound was apparently first published in
1980 [35]. Thus, parametric reductions of STFT data (in
the form of instantaneous amplitude and frequency envelopes of vocoder
filter-channel data) were in use in the 1970s, but we were not yet
resynthesizing sound by STFT using spectral buffers synthesized from
parameters.

### Sinusoidal Modeling Systems

With the phase vocoder, the instantaneous amplitude and frequency are
normally computed only for each ``channel filter''. A consequence of
using a fixed-frequency filter bank is that the frequency of each
sinusoid is not normally allowed to vary outside the bandwidth of its
channel bandpass filter, unless one is willing to combine channel
signals in some fashion which requires extra work. Ordinarily, the
bandpass center frequencies are harmonically spaced. *I.e.*, they are
integer multiples of a base frequency. So, for example, when analyzing
a piano tone, the intrinsic progressive sharpening of its partial
overtones leads to some sinusoids falling ``in the cracks'' between
adjacent filter channels. This is not an insurmountable condition
since the adjacent bins can be combined in a straightforward manner to
provide accurate amplitude and frequency envelopes,
but it is inconvenient and outside the original scope of the phase
vocoder (which, recall, was developed originally for speech, which is
fundamentally periodic (ignoring ``jitter'') when voiced at a constant
pitch). Moreover, it is relatively unwieldy to work with the
instantaneous amplitude and frequency signals from all of the
filter-bank channels. For these reasons, the phase vocoder has
largely been effectively replaced by sinusoidal modeling in the
context of analysis for additive synthesis of inharmonic sounds,
except in constrained computational environments (such as real-time
systems). In sinusoidal modeling, the fixed, uniform filter-bank of
the vocoder is replaced by a *sparse, peak-adaptive* filter bank,
implemented by following magnitude peaks in a sequence of FFTs. The
efficiency of the split-radix, Cooley-Tukey FFT
makes it computationally feasible to implement
an enormous number of bandpass filters in a fine-grained analysis
filter bank, from which the sparse, adaptive analysis filter bank is
derived. An early paper in this area is included as Appendix H.

Thus, modern sinusoidal models can be regarded as ``pruned phase vocoders'' in that they follow only the peaks of the short-time spectrum rather than the instantaneous amplitude and frequency from every channel of a uniform filter bank. Peak-tracking in a sliding short-time Fourier transform has a long history going back at least to 1957 [210,281]. Sinusoidal modeling based on the STFT of speech was introduced by Quatieri and McAulay [221,169,222,174,191,223]. STFT sinusoidal modeling in computer music began with the development of a pruned phase vocoder for piano tones [271,246] (processing details included in Appendix H).

### Inverse FFT Synthesis

As mentioned in the introduction to additive synthesis above (§G.8), typical systems originally used an explicit sum of sinusoidal oscillators [166,186,232,271]. For large numbers of sinusoidal components, it is more efficient to use the inverse FFT [239,143,142,139]. See §G.8.1 for further discussion.

### Sines+Noise Synthesis

In the late 1980s, Serra and Smith combined sinusoidal modeling with
noise modeling to enable more efficient synthesis of the noise-like
components of sounds (§10.4.3)
[246,249,250]. In this extension, the
output of the sinusoidal model is subtracted from the original signal,
leaving a residual signal. Assuming that the residual is a random
signal, it is modeled as *filtered white noise*, where the
magnitude envelope of its short-time spectrum becomes the filter
characteristic through which white noise is passed during resynthesis.

### Multiresolution Sinusoidal Modeling

Prior to the late 1990s, both vocoders and sinusoidal models were
focused on modeling single-pitched, monophonic sound sources, such as
a single saxophone note. Scott Levine showed that by going to
*multiresolution sinusoidal modeling*
(§10.4.4;§7.3.3), it becomes
possible to encode general polyphonic sound sources with a single
unified system
[149,147,148].
``Multiresolution'' refers to the use of a non-uniform filter bank,
such as a wavelet or ``constant Q'' filter bank, in the underlying
spectrum analysis.

### Transient Models

Another improvement to sines+noise modeling in the late 1990s was
explicit *transient modeling*
[6,149,147,144,148,290,282].
These methods address the principal remaining deficiency in
sines+noise modeling, preserving crisp ``attacks'', ``clicks'', and
the like, without having to use hundreds or thousands of sinusoids to
accurately resynthesize the transient.^{G.13}

The transient segment is generally ``spliced'' to the steady-state
sinusoidal (or sines+noise) segment by using *phase-matched
sinusoids* at the transition point. This is usually the only time
phase is needed for the sinusoidal components.

To summarize sines+noise+transient modeling of sound, we can recap as follows:

- sinusoids efficiently model tonal signal components
- filtered-noise efficiently models the what's left after removing the tonal components from a steady state spectrum
- transients should be handled separately to avoid the need for many sinusoids

### Time-Frequency Reassignment

A relatively recent topic in sinusoidal modeling is
*time-frequency reassignment*,
in which STFT phase information is
used to provide nonlinearly enhanced time-frequency resolution in STFT
displays [12,73,81]. The
basic idea is to refine the spectrogram (§7.2) by
assigning spectral energy in each bin to its ``center of gravity''
frequency instead of the bin center frequency. This has the effect of
significantly sharpening the appearance of spectrograms for certain
classes of signals, such as quasi-sinusoidal sums. In addition to
frequency reassignment, time reassignment is analogous.

### Perceptual Audio Compression

It often happens that the model which is most natural from a conceptual (and manipulation) point of view is also the most effective from a compression point of view. This is because, in the ``right'' signal model for a natural sound, the model's parameters tend to vary quite slowly compared with the audio rate. As an example, physical models of the human voice and musical instruments have led to expressive synthesis algorithms which can also represent high-quality sound at much lower bit rates (such as MIDI event rates) than normally obtained by encoding the sound directly [46,259,262,154].

The sines+noise+transients spectral model follows a natural perceptual decomposition of sound into three qualitatively different components: ``tones'', ``noises'', and ``attacks''. This compact representation for sound is useful for both musical manipulations and data compression. It has been used, for example, to create an audio compression format comparable in quality to MPEG-AAC [24,25,16] (at 32 kpbs), yet it can be time-scaled or frequency-shifted without introducing objectionable artifacts [149].

### Further Reading

A 74-page summary of sinusoidal modeling of sound, including sines+noise modeling is given in [223]. An update on the activities in Xavier Serra's lab in Barcelona is given in a dedicated chapter of the DAFX Book [10]. Scott Levine's most recent summary/review is [146]. Additional references related to sinusoidal modeling include [173,83,122,170,164,171,172,84,295,58,237,145,31].

**Next Section:**

Perceptual audio coding

**Previous Section:**

Phase Vocoder Sinusoidal Modeling