## Spectral Modeling Synthesis

Spectral Modeling Synthesis (SMS) generally refers to
any *parametric* creation of a short-time Fourier transform, *i.e.*,
something besides simply inverting an STFT, or a nonparametrically
modified STFT (such as standard FFT convolution). A primary example
of SMS is *sinusoidal modeling*, and its various extensions, as
described further below.

### Short Time Fourier Analysis, Modification, and Resynthesis

The Fourier duality of the overlap-add and filter-bank-summation short-time Fourier transform (discussed in Chapter 8) appeared in the late 1970s [7,9]. This unification of downsampled filter-banks and FFT processors spawned considerable literature in STFT processing [149,8,208,181,94,180]. While the phase vocoder is normally regarded as a fixed band-pass filter bank. The STFT, in contrast, is usually regarded as a time-ordered sequence of overlapping FFTs (the ``overlap-add'' interpretation). Generally speaking, sound reconstruction by STFT during this period was nonparametric. A relatively exotic example was signal reconstruction from STFT magnitude data [99,181,208].

In the speech modeling world, parametric sinusoidal modeling of the STFT apparently originated in the context of the magnitude-only reconstruction problem [209].

Since the phase vocoder was in use for measuring amplitude and
frequency envelopes for additive synthesis no later than
1977,^{H.8}it is natural to expect that parametric ``inverse FFT synthesis'' from
sinusoidal parameters would have begun by that time. Instead,
however, traditional banks of sinusoidal (and more general wavetable)
oscillators remained in wide use for many more years. Inverse FFT
synthesis of sound was apparently first published in
1980 [34]. Thus, parametric reductions of STFT data (in
the form of instantaneous amplitude and frequency envelopes of vocoder
filter-channel data) were in use in the 1970s, but we were not yet
resynthesizing sound by STFT using spectral buffers synthesized from
parameters.

### Sinusoidal Modeling Systems

With the phase vocoder, the instantaneous amplitude and frequency are
normally computed only for each ``channel filter''. A consequence of
using a fixed-frequency filter bank is that the frequency of each
sinusoid is not normally allowed to vary outside the bandwidth of its
channel band-pass filter, unless one is willing to combine channel
signals in some fashion which requires extra work. Ordinarily, the
band-pass center frequencies are harmonically spaced. *I.e.*, they are
integer multiples of a base frequency. So, for example, when analyzing
a piano tone, the intrinsic progressive sharpening of its partial
overtones leads to some sinusoids falling ``in the cracks'' between
adjacent filter channels. This is not an insurmountable condition
since the adjacent bins can be combined in a straightforward manner to
provide accurate amplitude and frequency envelopes,
but it is inconvenient and outside the original scope of the phase
vocoder (which, recall, was developed originally for speech, which is
fundamentally periodic (ignoring ``jitter'') when voiced at a constant
pitch). Moreover, it is relatively unwieldy to work with the
instantaneous amplitude and frequency signals from all of the
filter-bank channels. For these reasons, the phase vocoder has
largely been effectively replaced by sinusoidal modeling in the
context of analysis for additive synthesis of inharmonic sounds,
except in constrained computational environments (such as real-time
systems). In sinusoidal modeling, the fixed, uniform filter-bank of
the vocoder is replaced by a *sparse, peak-adaptive* filter bank,
implemented by following magnitude peaks in a sequence of FFTs. The
efficiency of the FFT makes it computationally feasible to implement
an enormous number of bandpass filters in a fine-grained analysis
filter bank, from which the sparse, adaptive analysis filter bank is
derived. An early paper in this area is included as Appendix I.

Thus, modern sinusoidal models can be regarded as ``pruned phase vocoders'' in that they follow only the peaks of the short-time spectrum rather than the instantaneous amplitude and frequency from every channel of a uniform filter bank. Peak-tracking in a sliding short-time Fourier transform has a long history going back approximately half a century [199,262]. Sinusoidal modeling based on the STFT of speech was introduced by Quatieri and McAulay [209,159,210,164,180,211]. STFT sinusoidal modeling in computer music began with the development of a pruned phase vocoder for piano tones [255,228] (processing details included in Appendix I).

### Inverse FFT Synthesis

As mentioned in the introduction to additive synthesis above (§H.8), systems used an explicit sum of sinusoidal oscillators [157,175,219,255]. For large numbers of sinusoidal components, it is more efficient to use the inverse FFT [225,135,134,132]. See §H.8.1 for further discussion.

### Sines+Noise Synthesis

In the late 1980s, Serra and Smith combined sinusoidal modeling with
noise modeling to enable more efficient synthesis of the noise-like
components of sounds [228,231,232]. In
this extension, the output of the sinusoidal model is subtracted from
the original signal, leaving a residual signal. Assuming that the
residual is a random signal, it is modeled as *filtered white
noise*, where the magnitude envelope of its short-time spectrum
becomes the filter characteristic through which white noise is passed
during resynthesis.

### Multiresolution Sinusoidal Modeling

Prior to the late 1990s, both vocoders and sinusoidal models were
focused on modeling single-pitched, monophonic sound sources, such as
a single saxophone note. Scott Levine showed that by going
to *multiresolution sinusoidal modeling*
(see §9.9), it becomes possible to encode general
polyphonic sound sources with a single unified system
[140,138,139].
``Multiresolution'' refers to the use of a non-uniform filter bank,
such as a wavelet or ``constant Q'' filter bank, in the underlying
spectrum analysis. See Fig.9.16 for an example
time-frequency resolution grid.

### S+N+T Sound Examples

A collection of sound examples illustrating sines+noise+transients
modeling (and various subsets thereof) as well as some audio effects
made possible by such representations, can be found online at
`http://ccrma.stanford.edu/~jos/pdf/SMS.pdf`
.

### Time-Frequency Reassignment

A relatively recent topic in sinusoidal modeling is
*time-frequency reassignment*, in which STFT phase information is
used to provide nonlinearly enhanced time-frequency resolution in STFT
displays [12,71,78]. The
basic idea is to refine the spectrogram (§6.2) by
assigning spectral energy in each bin to its ``center of gravity''
frequency instead of the bin center frequency. This has the effect of
significantly sharpening the appearance of spectrograms for certain
classes of signals, such as quasi-sinusoidal sums. In addition to
frequency reassignment, time reassignment is analogous.

### Perceptual Audio Compression

It often happens that the model which is most natural from a conceptual (and manipulative) point of view is also the most effective from a compression point of view. This is because, in the ``right'' signal model for a natural sound, the model's parameters tend to vary quite slowly compared with the audio rate. As an example, physical models of the human voice and musical instruments have led to expressive synthesis algorithms which can also represent high-quality sound at much lower bit rates (such as MIDI event rates) than normally obtained by encoding the sound directly [46,242,246,145].

The sines+noise+transients spectral model follows a natural perceptual decomposition of sound into three qualitatively different components: ``tones'', ``noises'', and ``attacks''. This compact representation for sound is useful for both musical manipulations and data compression. It has been used, for example, to create an audio compression format comparable in quality to MPEG-AAC [23,24,16] (at 32 kbits/s), yet it can be time-scaled or frequency-shifted without introducing objectionable artifacts [140].

### Time-Scale Modification of Sinusoidal Models

Sinusoidal models naturally support *time-scale modification*
(TSM) (and *frequency-scaling*, its Fourier dual), because the
original signal is replaced by oscillator amplitude and
frequency *envelopes* which are easily time-scaled without
causing unnatural artifacts. When amplitude or frequency envelopes
are rescaled in time, the oscillators are allowed to run continuously
under them, thereby avoiding artifacts. Similarly, in
sines-plus-noise synthesis, the time-varying noise-filter is a
time-frequency envelope that can be smoothly rescaled along either
dimension.

When time-stretching, say, a sines+noise model, transients are ``smeared out'' over time. In sines+noise+transients models, transients may be time-shifted instead. However, while this sounds very good when it works, it can be difficult to detect and preserve each and every transient in the signal. In fact, it can become difficult to define exactly what is meant by the term ``transient''. For example, an abrupt cello note onset is naturally defined as a starting transient for the note. If the note onset is then played in a more and more legato style, when is there no longer a starting transient? One idea is to define ``transientness'' as a variable between 0 and 1, so that the degree of TSM ``stretchiness'' for that time interval can be multiplied by (1-transientness).

Audio demonstrations of TSM and frequency-scaling based on the
sines+noise+transients model of Scott Levine may be found online at
`http://ccrma.stanford.edu/~jos/pdf/SMS.pdf`
.

See §9.9 for additional details.

### Further Reading

A 74-page summary of sinusoidal modeling of sound, including sines+noise modeling is given in [211]. An update on the activities in Xavier Serra's lab in Barcelona is given in a dedicated chapter of the DAFX Book [280]. Additional references related to sinusoidal modeling include [163,80,116,160,155,161,162,81,274,57,224,137,30].

**Next Section:**

Perceptual audio coding

**Previous Section:**

Speech Synthesis Examples