Spectrum Analysis of Sinusoids

Sinusoidal components are fundamental building blocks of sound. Any sound that can be described as a ``tone'' is naturally and efficiently modeled as a sum of windowed sinusoids over short, ``stationary'' time segments (e.g., on the order of 20 ms or more in the case of voice). Over longer time segments, tonal sounds are modeled efficiently by modulated sinusoids, where the amplitude and frequency modulations are relatively slow. This is the model used in additive synthesis (discussed further below). Of course, thanks to Fourier's theorem, every sound can be expressed as a sum of sinusoids having fixed amplitude and frequency, but this is a highly inefficient model for non-tonal and changing sounds. Perhaps more fundamentally from an audio modeling point of view, the ear is quite sensitive to peaks in the short-time spectrum of a sound, and a spectral peak is naturally modeled as a sinusoidal component which has been shaped by some kind of ``window function'' or ``amplitude envelope'' in the time domain.

Because spectral peaks are so relatively dominant in hearing, sinusoidal models are ``unreasonably effective'' in capturing the tonal aspects of sound in a compact, easy-to-manipulate form. Computing a sinusoidal model entails fitting the parameters of a sinusoid (amplitude, frequency, and sometimes phase) to each peak in the spectrum of each time-segment. In typical sinusoidal modeling systems, the sinusoidal parameters are linearly interpolated from one time segment to the next, and this usually provides a perceptually smooth variation over time. (Higher order interpolation has also been used.) Modeling sound as a superposition of modulated sinusoids in this way is generally called additive synthesis [233].

Additive synthesis is not the only sound modeling method that requires sinusoidal parameter estimation for its calibration to desired signals. For the same reason that additive synthesis is so effective, we routinely calibrate any model for sound production by matching the short-time Fourier transform, and in this matching process, spectral peaks are heavily weighted (especially at low frequencies in the audio range). Furthermore, when the model parameters are few, as in the physical model of a known musical instrument, the model parameters can be determined entirely by the amplitudes and frequencies of the sinusoidal peaks. In such cases, sinusoidal parameter estimation suffices to calibrate non-sinusoidal models.

Pitch detection is another application in which spectral peaks are ``explained'' as harmonics of some estimated fundamental frequency. The harmonic assumption is an example of a signal modeling constraint. Model constraints provide powerful means for imposing prior knowledge about the source of the sound being modeled.

Another application of sinusoidal modeling is source separation. In this case, spectral peaks are measured and tracked over time, and the ones that ``move together'' are grouped together as separate sound sources. By analyzing the different groups separately, polyphonic pitch detection and even automatic transcription can be addressed.

A less ambitious application related to source separation may be called ``selected source modification.'' In this technique, spectral peaks are grouped, as in source separation, but instead of actually separating them, they are merely processed differently. For example, all the peaks associated with a particular voice can be given a gain boost. This technique can be very effective for modifying one track in a mix--e.g., making the vocals louder or softer relative to the background music.

For purely tonal sounds, such as freely vibrating strings or the human voice (in between consonants), forming a sinusoidal model gives the nice side effect of noise reduction. For example, almost all low-level ``hiss'' in a magnetic tape recording is left behind by a sinusoidal model. The lack of noise between spectral peaks in a sound is another example of a model constraint. It is a strong suppressor of noise since the noise is entirely eliminated in between spectral peaks. Thus, sinusoidal models can be used for signal restoration.

Spectrum of a Sinusoid

A sinusoid is any signal of the form

$\displaystyle x(t) = A\cos(\omega_0 t + \phi), \quad t\in{\bf R}$ (6.1)

where $ A$ is the amplitude (in arbitrary units), $ \phi\in[-\pi,\pi)$ is the phase in radians, and $ \omega_0$ is the frequency in radians per second. Time $ t$ is a real number that varies continuously from minus infinity to infinity in the ideal sinusoid. All three parameters $ (A,\omega_0,\phi)$ are real numbers. In addition to radian frequency $ \omega$ , it is useful to define $ \omega =
2\pi f$ , where $ f$ is the frequency in Hertz (Hz).6.1

By Euler's identity, $ e^{j\theta} = \cos(\theta) +
j\sin(\theta)$ , we can write

\begin{eqnarray*}
x(t) &=& A \frac{e^{j(\omega_0 t + \phi)} + e^{-j(\omega_0 t + \phi)}}{2}\\
&=& \left(\frac{A}{2}e^{j \phi}\right) e^{j\omega_0 t}
+ \left(\frac{A}{2}e^{-j \phi}\right) e^{-j\omega_0 t}\\
&\isdef & a e^{j\omega_0 t} + \overline{a} e^{-j\omega_0 t}
\end{eqnarray*}

where $ \overline{a}$ denotes the complex conjugate of $ a$ . Thus, we can build a real sinusoid $ x(t)$ as a linear combination of positive- and negative-frequency complex sinusoidal components:

$\displaystyle x(t) = a s_{\omega_0}(t) + \overline{a} s_{-\omega_0}(t) \protect$ (6.2)

where

$\displaystyle s_{\omega_0}(t) \isdef e^{j\omega_0 t} \isdef e^{j2\pi f_0 t}, \qquad a\isdef \frac{A}{2}e^{j\phi}.$ (6.3)

The spectrum of $ x(t)$ is given by its Fourier transform (see §2.2):

\begin{eqnarray*}
X(\omega) &\isdef & \int_{-\infty}^{\infty} x(t) e^{-j\omega t} dt\nonumber \\ [5pt]
&=& \int_{-\infty}^{\infty} \left[a s_{\omega_0}(t) + \overline{a} s_{-\omega_0}(t)
\right] e^{-j\omega t} dt.
\end{eqnarray*}

In this case, $ x(t)$ is given by (5.2) and we have

$\displaystyle X(\omega) = a S_{\omega_0}(\omega) + \overline{a} S_{-\omega_0}(\omega). \protect$ (6.4)

We see that, since the Fourier transform is a linear operator, we need only work with the unit-amplitude, zero-phase, positive-frequency sinusoid $ s_{\omega_0}(t)\isdef e^{j\omega_0t}$ . For $ \omega_0>0$ , $ as_{\omega_0}(t)$ may be called the analytic signal corresponding to $ x(t)$ .6.2

It remains to find the Fourier transform of $ s_{\omega_0}(t)$ :

\begin{eqnarray*}
S_{\omega_0}(\omega)
&=& \int_{-\infty}^{\infty} s_{\omega_0}(t) e^{-j\omega t} dt\\ [5pt]
&\isdef & \int_{-\infty}^{\infty} e^{j\omega_0 t} e^{-j\omega t} dt\\ [5pt]
&=& \int_{-\infty}^{\infty} e^{j(\omega_0-\omega) t} dt\\ [5pt]
&=& \left.\frac{1}{j(\omega_0-\omega)} e^{j(\omega_0-\omega) t}
\right\vert _{-\infty}^\infty\\ [5pt]
&=& \lim_{\Delta\to\infty} 2\frac{\sin[(\omega_0-\omega)\Delta]}{\omega_0-\omega}\\ [5pt]
&=& 2\pi\delta(\omega_0-\omega) = \delta(f_0-f),
\end{eqnarray*}

where $ \delta(\omega)$ is the delta function or impulse at frequency $ \omega_0$ (see Fig.5.4 for a plot, and §B.10 for a mathematical introduction). Since the delta function is even ( $ \delta(-\omega) = \delta(\omega)$ ), we can also write $ S_{\omega_0}(\omega) = 2\pi\delta(\omega-\omega_0) =
\delta(f-f_0)$ . It is shown in §B.13 that the sinc limit above approaches a delta function $ \delta(f_0-f)$ . However, we will only use the Discrete Fourier Transform (DFT) in any practical applications, and in that case, the result is easy to show [264].

The inverse Fourier transform is easy to evaluate by the sifting property6.3of delta functions:

$\displaystyle s_{\omega_0}(t) = \frac{1}{2\pi}\int_{-\infty}^\infty S_{\omega_0}(\omega) e^{j\omega t} d\omega = \int_{-\infty}^\infty \delta(\omega-\omega_0) e^{j\omega t} d\omega = e^{j\omega_0 t}$ (6.6)

Substituting into (5.4), the spectrum of our original sinusoid $ x(t)$ is given by

$\displaystyle X(\omega) = 2\pi\left[a \delta(\omega-\omega_0) + \overline{a}\delta(\omega+\omega_0)\right]$ (6.7)

which is a pair of impulses, one at frequency $ \omega=\omega_0$ having complex amplitude $ 2\pi a = A \pi e^{j\phi}$ , summed with another at frequency $ \omega=-\omega_0$ with complex amplitude $ 2\pi\overline{a} = A\pi
e^{-j\phi}$ .


Spectrum of Sampled Complex Sinusoid

In the discrete-time case, we replace $ t$ by $ nT$ where $ n$ ranges over the integers and $ T$ is the sampling period in seconds. Thus, for the positive-frequency component of the sinusoid of the previous section, we obtain

$\displaystyle s_{\omega_0}(n) \isdef e^{j\omega_0 n T}.$ (6.8)

It is common notational practice in signal processing to use normalized radian frequency

$\displaystyle {\tilde \omega}\isdef \omega T \;\in[-\pi,\pi).$ (6.9)

Thus, our sampled complex sinusoid becomes

$\displaystyle s_{\omega_0}(n) \isdef e^{j{\tilde \omega}_0 n}.$ (6.10)

It is not difficult to convert between normalized and unnormalized frequency. The use of a tilde (` $ \tilde{\null}$ ') will explicitly indicate normalization, but it may be left off as well, so that $ \omega$ may denote either normalized or unnormalized frequency.6.4

The spectrum of infinitely long discrete-time signals is given by the Discrete Time Fourier Transform (DTFT) (discussed in §2.1):

$\displaystyle S_{\omega_0}(\omega) \isdef \sum_{n=-\infty}^{\infty} s_{\omega_0}(n) e^{-j\omega n} = 2\pi\delta(\omega-\omega_0) = \delta(f-f_0)$ (6.11)

where now $ \delta(\omega)$ is an impulse defined for $ \omega\in[-\pi,\pi)$ or $ f\in\left[-\frac{1}{2},\frac{1}{2}\right)$ , and $ \omega$ denotes normalized radian frequency. (Treatments of the DTFT invariably use normalized frequency.)


Spectrum of a Windowed Sinusoid

Ideal sinusoids are infinite in duration. In practice, however, we must work with finite-length signals. Therefore, only a finite segment of a sinusoid can be processed at any one time, as if we were looking at the sinusoid through a ``window'' in time. For audio signals, the ear also processes sinusoids in short time windows (much less than 1 second long); thus, audio spectrum analysis is generally carried out using analysis windows comparable to the time-window inherent in hearing. Finally, nothing in nature produces true sinusoids. However, natural oscillations can often be modeled as sinusoidal over some finite time. Thus, it is useful to consider short-time spectrum analysis, in which the time-domain signal is analyzed in short time segments (windows). The short-time spectrum then evolves at some rate over time. The shorter our analysis window, the faster the short-time spectrum can change. Very long analysis windows, on the other hand, presuppose a fixed spectral content over the duration of the window.

The easiest windowing operation is simply truncating the sinusoid on the left and right at some finite times. This can be modeled mathematically as a multiplication of the sinusoid by the so-called rectangular window:

$\displaystyle w_R(n) \isdef \left\{\begin{array}{ll} 1, & \left\vert n\right\vert\leq\frac{M-1}{2} \\ [5pt] 0, & \hbox{otherwise} \\ \end{array} \right.$ (6.12)

where $ M$ is the length of the window (assumed odd for simplicity). The windowed sinusoid is then

$\displaystyle s_R(n) \isdef s_{\omega_0}(n)w_R(n)$ (6.13)

Figure 5.1: A typical analysis window used in short-time spectrum analysis.
\includegraphics[width=4in]{eps/generalWindow}

However, as we will see, it is often advantageous to taper the window more gracefully to zero on the left and right. Figure 5.1 illustrates the general shape of a more typical window function. Note that it is nonnegative and symmetric about time 0. Such functions are loosely called ``zero-phase'' since their Fourier transforms are real, as shown in §2.3. A more precise adjective is zero-centered for such windows.

In some cases, such as when analyzing real time systems, it is appropriate to require our analysis windows be causal. A window $ w(n)$ is said to be causal if it is zero for all $ n<0$ . Figure 5.2 depicts the causal version of the window shown in Fig.5.1. A length $ M$ zero-centered window can be made causal by shifting (delaying) it in time by half its length $ (M-1)/2$ . The shift theorem2.3.4) tells us this introduces a linear phase term in the frequency domain. That is, the window Fourier transform has gone from being real to being multiplied by $ \exp[-j\omega(M-1)/2]$ .

Figure: Causal version of the zero-centered window in Fig.5.1.
\includegraphics[width=3.5in]{eps/causalWindow}


Effect of Windowing

Let's look at a simple example of windowing to demonstrate what happens when we turn an infinite-duration signal into a finite-duration signal through windowing.

We begin with a sampled complex sinusoid:

$\displaystyle s_{\omega_0}(n) = e^{j \omega_0 n T }, \quad n \in {\bf Z}$ (6.14)

A portion of the real part, $ \cos(\omega_0 nT)$ , is plotted in Fig.5.3. The imaginary part, $ \sin(\omega_0 nT)$ , is of course identical but for a 90-degree phase-shift to the right.

Figure: A portion of the real part of the sinusoid $ s_{\omega _0}(n)$ .
\includegraphics[width=0.6\twidth]{eps/infDurSin}

The Fourier transform of this infinite-duration signal is a delta function at $ \omega=\omega_0$ . I.e., $ S_{\omega_0}(\omega) = 2\pi\delta(\omega-\omega_0) =
\delta(f-f_0)$ , as indicated in Fig.5.4.

Figure: Spectrum (DTFT) of an infinite-duration sinusoid at frequency $ f_0$ Hz.
\includegraphics{eps/infDurSinSpec}

The windowed signal is

$\displaystyle s_R(n) = w(n)e^{j \omega_0 n T}, \quad n \in {\bf Z}$ (6.15)

as shown in Fig.5.5. (Note carefully the difference between $ w$ and $ \omega$ .)

Figure 5.5: Windowed sinusoid real part.
\includegraphics[width=4in,height=2in]{eps/windowedSin}

The convolution theorem2.3.5) tells us that our multiplication in the time domain results in a convolution in the frequency domain. Hence, we will obtain the convolution of $ \delta(\omega-\omega_0)$ with the Fourier transform of the window $ W(\omega)$ . This is easy since the delta function is the identity element under convolution ( $ \delta \ast W = W$ ). However, since our delta function is at frequency $ \omega=\omega_0$ , the convolution shifts the window transform out to that frequency:

$\displaystyle S_R(\omega) = W(\omega)\ast 2\pi\delta(\omega - \omega_0) = 2\pi W(\omega-\omega_0)$ (6.16)

This is shown in Fig.5.6.

Figure: Fourier transform of the windowed sinusoid in Fig.5.5: Top: Real Fourier transform amplitude. Bottom: Fourier transform magnitude in decibels (dB).
\includegraphics[width=\twidth]{eps/windowedSinSpec}

From comparing Fig.5.6 with the ideal sinusoidal spectrum in Fig.5.4 (an impulse at frequency $ \omega_0$ ), we can make some observations:

  • Windowing in the time domain resulted in a ``smearing'' or ``smoothing'' in the frequency domain. In particular, the infinitely thin delta function has been replaced by the ``main lobe'' of the window transform. We need to be aware of this if we are trying to resolve sinusoids which are close together in frequency.

  • Windowing also introduced side lobes. This is important when we are trying to resolve low amplitude sinusoids in the presence of higher amplitude signals.

  • A sinusoid at amplitude $ A$ , frequency $ \omega_0$ , and phase $ \phi$ manifests (in practical spectrum analysis) as a window transform shifted out to frequency $ \omega_0$ , and scaled by $ A e^{j\phi}$ .

As a result of the last point above, the ideal window transform is an impulse in the frequency domain. Since this cannot be achieved in practice, we try to find spectrum-analysis windows which approximate this ideal in some optimal sense. In particular, we want side-lobes that are as close to zero as possible, and we want the main lobe to be as tall and narrow as possible. (Since absolute scalings are normally arbitrary in signal processing, ``tall'' can be defined as the ratio of main-lobe amplitude to side-lobe amplitude--or main-lobe energy to side-lobe energy, etc.) There are many alternative formulations for ``approximating an impulse'', and each such formulation leads to a particular spectrum-analysis window which is optimal in that sense. In addition to these windows, there are many more which arise in other applications. Many commonly used window types are summarized in Chapter 3.

Frequency Resolution

The frequency resolution of a spectrum-analysis window is determined by its main-lobe width (Chapter 3) in the frequency domain, where a typical main lobe is illustrated in Fig.5.6 (top). For maximum frequency resolution, we desire the narrowest possible main-lobe width, which calls for the rectangular window3.1), the transform of which is shown in Fig.3.3. When we cannot be fooled by the large side-lobes of the rectangular window transform (e.g., when the sinusoids under analysis are known to be well separated in frequency), the rectangular window truly is the optimal window for the estimation of frequency, amplitude, and phase of a sinusoid in the presence of stationary noise [230,120,121].

The rectangular window has only one parameter (aside from amplitude)--its length. The next section looks at the effect of an increased window length on our ability to resolve two sinusoids.

Two Cosines (``In-Phase'' Case)

Figure 5.7 shows a spectrum analysis of two cosines

$\displaystyle x(n) = \cos(\omega_1 n) + \cos(\omega_2 n), \quad n=0,1,\ldots,M-1,$ (6.17)

where $ \omega_1 = \pi/2$ and $ \omega_2 = \omega_1 + \Delta\omega$ , and the frequency separation $ \Delta \omega = \omega_2-\omega_1$ is $ 2\pi/40$ radians per sample. The zero-padded Fourier analysis uses rectangular windows of lengths $ M=20$ , $ 30$ , $ 40$ , and $ 80$ ( $ \Delta\omega =
\frac{1}{2}\Omega_M,
\frac{3}{4}\Omega_M, \Omega_M, 2\Omega_M$ , where $ \Omega_M\isdef 2\pi/M$ ). The length $ N=1024$ FFT output is divided by $ M$ so that the ideal height of each spectral peak is $ \max_{\omega_k}\{\vert X(\omega_k)\vert\}=1/2$ .

Figure: DTFT of two closely spaced in-phase sinusoids, various rectangular-window lengths $ M$ .
\includegraphics[width=\twidth]{eps/resolvedSines}

The longest window ($ M=80$ ) resolves the sinusoids very well, while the shortest case ($ M=20$ ) does not resolve them at all (only one ``lump'' appears in the spectrum analysis). In difference-frequency cycles, the analysis windows are two cycles and half a cycle in these cases, respectively. It can be debated whether or not the other two cases are resolved, and we will return to them shortly.


One Sine and One Cosine ``Phase Quadrature'' Case

Figure 5.8 shows a similar spectrum analysis of two sinusoids

$\displaystyle x(n) = \sin(\omega_1 n) + \cos(\omega_2 n), \quad n=0,1,\ldots,M-1,$ (6.18)

using the same frequency separation and window lengths. However, now the sinusoids are 90 degrees out of phase (one sine and one cosine). Curiously, the top-left case ( $ M=20=\hbox{1/2 difference-frequency
cycle}$ ) now appears to be resolved! However, closer inspection (see Fig.5.9) reveals that the ``resolved'' spectral peaks are significantly far away from the sinusoidal frequencies. Another curious observation is that the lower-left case ( $ M=40=\hbox{1
difference-frequency cycle}$ ) appears worse off than it did in Fig.5.7, and worse than the shorter-window analysis at the top right of Fig.5.8. Only the well resolved case at the lower right (spanning two full cycles of the difference frequency) appears unaffected by the relative phase of the two sinusoids under analysis.

Figure: DTFT of two closely spaced sinusoids in phase quadrature, various window lengths $ M$ .
\includegraphics[width=\twidth]{eps/resolvedSinesB}

Figure 5.9 shows the same plots as in Fig.5.8, but overlaid. From this we can see that the peak locations are biased in under-resolved cases, both in amplitude and frequency.

Figure: Overlay of the plots in Fig.5.8.
\includegraphics[width=\textwidth ]{eps/resolvedSinesC2C}

The preceding figures suggest that, for a rectangular window of length $ M$ , two sinusoids are well resolved when they are separated in frequency by

$\displaystyle \zbox {\Delta\omega\geq 2\Omega_M} \qquad \left(\Omega_M \isdef \frac{2\pi}{M}\right),$ (6.19)

where the frequency-separation $ \Delta \omega = \omega_2-\omega_1$ is in radians per sample. In cycles per sample, the inequality becomes

$\displaystyle \zbox {\Delta {\tilde f}\geq \frac{2}{M}},$ (6.20)

where the $ {\tilde f}\isdef f/f_s = fT$ denotes normalized frequency in cycles per sample. In Hz, we have

$\displaystyle \Delta f\geq 2\frac{f_s}{M}.$ (6.21)

or

$\displaystyle \zbox {M \geq 2\frac{f_s}{\Delta f}.}$ (6.22)

Note that $ f_s/f$ is the number of samples in one period of a sinusoid at frequency $ f$ Hz, sampled at $ f_s$ Hz. Therefore, we have derived a rule of thumb for frequency resolution that requires at least two full cycles of the difference-frequency under the rectangular window.

A more detailed study [1] reveals that $ 1.44$ cycles of the difference-frequency is sufficient to enable fully accurate peak-frequency measurement under the rectangular window by means of finding FFT peaks. In §5.5.2 below, additional minimum duration specifications for resolving closely spaced sinusoids are given for other window types as well.

In principle, we can resolve arbitrarily small frequency separations, provided

  • there is no noise, and
  • we are sure we are looking at the sum of two ideal sinusoids under the window.
One method for doing this is described in §5.7.2. However, in practice, there is almost always some noise and/or interference from other signals, so we normally prefer to require sinusoidal frequency separation by on the order of one main-lobe width or more.

The rectangular window provides an abrupt transition at its edge. While it remains the optimal window for sinusoidal peak estimation, it is by no means optimal in all spectrum analysis and/or signal processing applications involving spectral processing. As discussed in Chapter 3, windows with a more gradual transition to zero have lower side-lobe levels, and this is beneficial for spectral displays and various signal processing applications based on FFT methods. We will encounter such applications in later chapters.


Resolving Sinusoids

We saw in §5.4.1 that our ability to resolve two closely spaced sinusoids is determined by the main-lobe width of the window transform we are using. We will now study this relationship in more detail.

For starters, let's define main-lobe bandwidth very simply (and somewhat crudely) as the distance between the first zero-crossings on either side of the main lobe, as shown in Fig.5.10 for a rectangular-window transform. Let $ B_w$ denote this width in Hz. In normalized radian frequency units, as used in the frequency axis of Fig.5.10, $ B_w$ Hz translates to $ 2\pi B_w / f_s$ radians per sample, where $ f_s$ denotes the sampling rate in Hz.

Figure 5.10: Window transform with main-lobe width marked.
\includegraphics[width=\twidth]{eps/rectWinMLM}

For the length-$ M$ unit-amplitude rectangular window defined in §3.1, the DTFT is given analytically by

$\displaystyle W_R(\omega) = M\cdot\hbox{asinc}_M(\omega) \isdef \frac{ \sin \left( M\omega T/2 \right)}{\sin(\omega T/2)} = \frac{ \sin \left( M\pi f T \right)}{\sin(\pi f T)}, \protect$ (6.23)

where $ f$ is frequency in Hz, and $ T$ is the sampling interval in seconds ($ T=1/f_s$ ). The main lobe of the rectangular-window transform is thus ``two side lobes wide,'' or

$\displaystyle \zbox {B_w = 2\frac{f_s}{M}}\quad \hbox{(\hbox{Hz})}$ (6.24)

as can be seen in Fig.5.10.

Recall from §3.1.1 that the side-lobe width in a rectangular-window transform ($ f_s/M$ Hz) is given in radians per sample by

$\displaystyle \zbox {\Omega_M \isdef \frac{2\pi}{M}.}$ (6.25)

As Fig.5.10 illustrates, the rectangular-window transform main-lobe width is $ 2\Omega_M$ radians per sample (two side-lobe widths). Table 5.1 lists the main-lobe widths for a variety of window types (which are defined and discussed further in Chapter 3).


Table 5.1: Main-lobe bandwidth for various windows.
Window Type Main-Lobe Width $ K\Omega_M$ (rad/sample)
Rectangular $ 2\Omega_M$
Hamming $ 4\Omega_M$
Hann $ 4\Omega_M$
Generalized Hamming $ 4\Omega_M$
Blackman $ 6\Omega_M$
$ L$ -term Blackman-Harris $ 2L\Omega_M$
Kaiser depends on $ \beta $
Chebyshev depends on ripple spec


Other Definitions of Main Lobe Width

Our simple definition of main-lobe band-width (distance between zero-crossings) works pretty well for windows in the Blackman-Harris family, which includes the first six entries in Table 5.1. (See §3.3 for more about the Blackman-Harris window family.) However, some windows have smooth transforms, such as the Hann-Poisson (Fig.3.21), or infinite-duration Gaussian window (§3.11). In particular, the true Gaussian window has a Gaussian Fourier transform, and therefore no zero crossings at all in either the time or frequency domains. In such cases, the main-lobe width is often defined using the second central moment.6.5

A practical engineering definition of main-lobe width is the minimum distance about the center such that the window-transform magnitude does not exceed the specified side-lobe level anywhere outside this interval. Such a definition always gives a smaller main-lobe width than does a definition based on zero crossings.

Figure 5.11: Lowpass filter design parameters.
\includegraphics[width=3in]{eps/windowParameters}

In filter-design terminology, regarding the window as an FIR filter and its transform as a lowpass-filter frequency response [263], as depicted in Fig.5.11, we can say that the side lobes are everything in the stop band, while the main lobe is everything in the pass band plus the transition band of the frequency response. The pass band may be defined as some small interval about the midpoint of the main lobe. The wider the interval chosen, the larger the ``ripple'' in the pass band. The pass band can even be regarded as having zero width, in which case the main lobe consists entirely of transition band. This formulation is quite useful when designing customized windows by means of FIR filter design software, such as in Matlab or Octave (see §4.5.1, §4.10, and §3.13).


Simple Sufficient Condition for Peak Resolution

Recall from §5.4 that the frequency-domain image of a sinusoid ``through a window'' is the window transform scaled by the sinusoid's amplitude and shifted so that the main lobe is centered about the sinusoid's frequency. A spectrum analysis of two sinusoids summed together is therefore, by linearity of the Fourier transform, the sum of two overlapping window transforms, as shown in Fig.5.12 for the rectangular window. A simple sufficient requirement for resolving two sinusoidal peaks spaced $ \Delta f$ Hz apart is to choose a window length long enough so that the main lobes are clearly separated when the sinusoidal frequencies are separated by $ \Delta f$ Hz. For example, we may require that the main lobes of any Blackman-Harris window meet at the first zero crossings in the worst case (narrowest frequency separation); this is shown in Fig.5.12 for the rectangular-window.

Figure: Two length-$ M$ -rectangular-window transforms displaced by $ 2\pi \Delta f T=2\Omega _M=4\pi /M$ rad/sample.
\includegraphics[width=\twidth]{eps/sinesAnn}

To obtain the separation shown in Fig.5.12, we must have $ B_w
\leq \Delta f$ Hz, where $ B_w$ is the main-lobe width in Hz, and $ \Delta f$ is the minimum sinusoidal frequency separation in Hz.

For members of the $ L$ -term Blackman-Harris window family, $ B_w$ can be expressed as $ B_w = 2L f_s/M$ , as indicated by Table 5.1. In normalized radian frequency units, i.e., radians per sample, we have $ 2\pi B_w T = 2L\Omega_M\isdeftext
K\Omega_M$ . For comparison, Table 5.2 lists minimum effective values of $ K$ for each window (denoted $ K^\ast$ ) given by an empirically verified sharper lower bound on the value needed for accurate peak-frequency measurement [1], as discussed further in §5.5.4 below.


Table: Main-lobe width-in-bins $ K$ for various windows.
Window Type $ K$ $ K^\ast$
Rectangular $ 2$ $ 1.44$
Hamming $ 4$ $ 2.22$
Hann $ 4$ $ 2.36$
Generalized Hamming $ 4$ --
Blackman $ 6$ $ 2.02$
$ L$ -term Blackman-Harris $ 2L$  


We make the main-lobe width $ B_w$ smaller by increasing the window length $ M$ . Specifically, requiring $ B_w
\leq \Delta f$ Hz implies

$\displaystyle B_w = K \frac{f_s}{M} \leq \Delta f \quad \Rightarrow \quad M \ge K \frac{f_s}{\Delta f}$ (6.26)

or

$\displaystyle \zbox {M \ge K \frac{f_s}{\left\vert f_2-f_1\right\vert}.} \protect$ (6.27)

Thus, to resolve the frequencies $ f_1$ and $ f_2$ , the window length $ M$ must span at least $ K$ periods of the difference frequency $ f_2-f_1$ , measured in samples, where $ K$ is the effective width of the main lobe in side-lobe widths $ f_M\isdeftext f_s/M$ . Let $ D\isdeftext \lceil f_s/\vert f_2-f_1\vert\rceil$ denote the difference-frequency period in samples, rounded up to the nearest integer. Then an ``$ L$ -term'' Blackman-Harris window of length $ M\ge KD=2LD$ samples may be said to resolve the sinusoidal frequencies $ f_1$ and $ f_2$ . Using Table 5.2, the minimum resolving window length can be determined using the sharper bound as $ M\ge \lceil K^\ast \cdot D\rceil$ .


Periodic Signals

Many signals are periodic in nature, such as short segments of most tonal musical instruments and speech. The sinusoidal components in a periodic signal are constrained to be harmonic, that is, occurring at frequencies that are an integer multiple of the fundamental frequency $ f_1$ .6.6 Physically, any ``driven oscillator,'' such as bowed-string instruments, brasses, woodwinds, flutes, etc., is usually quite periodic in normal steady-state operation, and therefore generates harmonic overtones in steady state. Freely vibrating resonators, on the other hand, such as plucked strings, gongs, and ``tonal percussion'' instruments, are not generally periodic.6.7

Consider a periodic signal with fundamental frequency $ f_1$ Hz. Then the harmonic components occur at integer multiples of $ f_1$ , and so they are spaced in frequency by $ \Delta f = f_1$ . To resolve these harmonics in a spectrum analysis, we require, adapting (5.27),

$\displaystyle \zbox {M \geq K \frac{f_s}{f_1}}$ (6.28)

Note that $ f_s/f_1 \isdef P$ is the fundamental period of the signal in samples. Thus, another way of stating our simple, sufficient resolution requirement on window length $ M$ , for periodic signals with period $ P$ samples, is $ \zbox {M \geq KP}$ , where $ K$ is the main-lobe width in bins (when critically sampled) given in Table 5.2. Chapter 3 discusses other window types and their characteristics.

Specifically, resolving the harmonics of a periodic signal with period $ P$ samples is assured if we have at least

and so on, according to the simple, sufficient criterion of separating the main lobes out to their first zero-crossings. These different lengths can all be regarded as the same ``effective length'' (two periods) for each window type. Thus, for example, when the Blackman window is 6 periods long, its effective length is only 2 periods, as illustrated in Figures 5.13(a) through 5.13(c).

Figure 5.13: Three different window types applied to the same sinusoidal signal, where the window lengths were chosen to provide approximately the same ``resolving power'' for two sinusoids closely spaced in frequency. The nominal effective length in all cases is two sinusoidal periods.
\begin{figure*}\centering
\subfigure[Length $M=100$\ rectangular window]{
\epsfxsize =\twidth
\epsfysize =1.5in \epsfbox{eps/EffectiveLength1.eps}
}
\subfigure[Length $M=200$\ Hamming window]{
\epsfxsize =\twidth
\epsfysize =1.5in \epsfbox{eps/EffectiveLength2.eps}
}
\subfigure[Length $M=300$\ Blackman window]{
\epsfxsize =\twidth
\epsfysize =1.5in \epsfbox{eps/EffectiveLength3.eps}
}\end{figure*}


Tighter Bounds for Minimum Window Length

[This section, adapted from [1], is relatively advanced and may be skipped without loss of continuity.]

Figures 5.14(a) through 5.14(d) show four possible definitions of main-lobe separation that could be considered for purposes of resolving closely spaced sinusoidal peaks.

Figure 5.14: Four alternative main-lobe displacement rules for resolving closely spaced sinusoidal peaks. For peak-frequency measurements based on a few samples at the main-lobe center, (a) is suboptimal, (b) is nearly optimal [1], (c) is analogous to a filter-design specification based on side-lobe level, and (d) is overly conservative, but sufficient and simple to compute for Blackman-Harris windows.
\begin{figure*}\centering
\subfigure[Minimum orthogonal separation]{
\epsfxsize =4.0in
\epsfysize =1.0in \epsfbox{eps/SpecResCases1.eps}
}
\subfigure[Separation to first stationary-point \cite{AbeAndSmith04}]{
\epsfxsize =4.0in
\epsfysize =1.0in \epsfbox{eps/SpecResCases2.eps}
}
\subfigure[Separation to where main-lobe level equals side-lobe level]{
\epsfxsize =4.0in
\epsfysize =1.0in \epsfbox{eps/SpecResCases3.eps}
}
\subfigure[Full main-lobe separation]{
\epsfxsize =4.0in
\epsfysize =1.0in \epsfbox{eps/SpecResCases4.eps}
}\end{figure*}

In Fig.5.14(a), the main lobes sit atop each other's first zero crossing. We may call this the ``minimum orthogonal separation,'' so named because we know from Discrete Fourier Transform theory [264] that $ M$ -sample segments of sinusoids at this frequency-spacing are exactly orthogonal. ($ M$ is the rectangular-window length as before.) At this spacing, the peak of each main lobe is unchanged by the ``interfering'' window transform. However, the slope and higher derivatives at each peak are modified by the presence of the interfering window transform. In practice, we must work over a discrete frequency axis, and we do not, in general, sample exactly at each main-lobe peak. Instead, we usually determine an interpolated peak location based on samples near the true peak location. For example, quadratic interpolation, which is commonly used, requires at least three samples about each peak (as discussed in §5.7 below), and it is therefore sensitive to a nonzero slope at the peak. Thus, while minimum-orthogonal spacing is ideal in the limit as the sampling density along the frequency axis approaches infinity, it is not ideal in practice, even when we know the peak frequency-spacing exactly.6.8

Figure 5.14(b) shows the ``zero-error stationary point'' frequency spacing. In this case, the main-lobe peak of one $ \hbox{asinc}$ sits atop the first local minimum from the main-lobe of the other $ \hbox{asinc}$ . Since the derivative of both $ \hbox{asinc}$ functions is zero at both peak frequencies at this spacing, the peaks do not ``sit on a slope'' which would cause the peak locations to be biased away from the sinusoidal frequencies. We may say that peak-frequency estimates based on samples about the peak will be unbiased, to first order, at this spacing. This minimum spacing, which is easy to compute for Blackman-Harris windows, turns out to be very close to the optimal minimum spacing [1].

Figure 5.14(c) shows the minimum frequency spacing which naturally matches side-lobe level. That is, the main lobes are pulled apart until the main-lobe level equals the worst-case side-lobe level. This spacing is usually not easy to compute, and it is best matched with the Chebyshev window (see §3.10). Note that it is just a little wider than the stationary-point spacing discussed in the previous paragraph.

For ease of comparison, Fig.5.14(d) shows once again the simple, sufficient rule (''full main-lobe separation'') discussed in §5.5.2 above. While overly conservative, it is easily computed for many window types (any window with a known main-lobe width), and so it remains a useful rule-of-thumb for determining minimum window length given the minimum expected frequency spacing.

A table of minimum window lengths for the Kaiser window, as a function of frequency spacing, is given in §3.9.


Summary

We see that when measuring sinusoidal peaks, it is important to know the minimum frequency separation of the peaks, and to choose an FFT window which is long enough to resolve the peaks accurately. Generally speaking, the window must ``see'' at least 1.5 cycles of the minimum difference frequency. The rectangular window ``sees'' its full length. Other windows, which are all tapered in some way (Chapter 3), see an effective duration less than the window length in samples. Further details regarding theoretical and empirical estimates are given in [1].


Sinusoidal Peak Interpolation

In §2.5, we discussed ideal spectral interpolation (zero-padding in the time domain followed by an FFT). Since FFTs are efficient, this is an efficient interpolation method. However, in audio spectral modeling, there is usually a limit on the needed accuracy due to the limitations of audio perception. As a result, a ``perceptually ideal'' spectral interpolation method that is even more efficient is to zero-pad by some small factor (usually less than 5), followed by quadratic interpolation of the spectral magnitude. We call this the quadratically interpolated FFT (QIFFT) method [271,1]. The QIFFT method is usually more efficient than the equivalent degree of ideal interpolation, for a given level of perceptual error tolerance (specific guidelines are given in §5.6.2 below). The QIFFT method can be considered an approximate maximum likelihood method for spectral peak estimation, as we will see.

Quadratic Interpolation of Spectral Peaks

In quadratic interpolation of sinusoidal spectrum-analysis peaks, we replace the main lobe of our window transform by a quadratic polynomial, or ``parabola''. This is valid for any practical window transform in a sufficiently small neighborhood about the peak, because the higher order terms in a Taylor series expansion about the peak converge to zero as the peak is approached.

Note that, as mentioned in §D.1, the Gaussian window transform magnitude is precisely a parabola on a dB scale. As a result, quadratic spectral peak interpolation is exact under the Gaussian window. Of course, we must somehow remove the infinitely long tails of the Gaussian window in practice, but this does not cause much deviation from a parabola, as shown in Fig.3.36.

\begin{psfrags}
% latex2html id marker 16827\psfrag{a} []{ \Large$ \alpha $}\psfrag{b} []{ \Large$ \beta $}\psfrag{g} []{ \Large$ \gamma $\ }\begin{figure}[htbp]
\includegraphics[width=\textwidth ]{eps/parabola}
\caption{Illustration of
parabolic peak interpolation using the three samples nearest the peak.}
\end{figure}
\end{psfrags}

Referring to Fig.5.15, the general formula for a parabola may be written as

$\displaystyle y(x) \mathrel{\stackrel{\Delta}{=}}a(x-p)^2+b$ (6.29)

The center point $ p$ gives us our interpolated peak location (in bins), while the amplitude $ b$ equals the peak amplitude (typically in dB). The curvature $ 2a$ depends on the window used and contains no information about the sinusoid. (It may, however, indicate that the peak being interpolated is not a pure sinusoid.)

At the three samples nearest the peak, we have

\begin{eqnarray*}
y(-1) &=& \alpha \\
y(0) &=& \beta \\
y(1) &=& \gamma
\end{eqnarray*}

where we arbitrarily renumbered the bins about the peak $ -1$ , 0, and 1. Writing the three samples in terms of the interpolating parabola gives

\begin{eqnarray*}
\alpha &=& ap^2 + 2ap + a + b \\
\beta &=& ap^2 + b \\
\gamma &=& ap^2 - 2ap + a + b
\end{eqnarray*}

which implies

\begin{eqnarray*}
\alpha- \gamma &=& 4ap \\
\Rightarrow\quad p &=& \frac{\alpha-\gamma}{4a} \\
\Rightarrow\quad \alpha &=& ap^2 + \left(\frac{\alpha-\gamma}{2}\right)
+a+(\beta-ap^2) \\
\Rightarrow\quad a &=& \frac{1}{2}(\alpha - 2\beta + \gamma) \\
\end{eqnarray*}

Hence, the interpolated peak location is given in bins6.9 (spectral samples) by

$\displaystyle \zbox {p=\frac{1}{2}\frac{\alpha-\gamma}{\alpha-2\beta+\gamma}} \in [-1/2,1/2].$ (6.30)

If $ k^\ast$ denotes the bin number of the largest spectral sample at the peak, then $ k^\ast+p$ is the interpolated peak location in bins. The final interpolated frequency estimate is then $ (k^\ast+p)f_s/N$ Hz, where $ f_s$ denotes the sampling rate and $ N$ is the FFT size.

Using the interpolated peak location, the peak magnitude estimate is

$\displaystyle \zbox {y(p) = \beta - \frac{1}{4}(\alpha-\gamma)p.}$ (6.31)

Phase Interpolation at a Peak

Note that only the spectral magnitude is used to find $ p$ in the parabolic interpolation scheme of the previous section. In some applications, a phase interpolation is also desired.

In principle, phase interpolation is independent of magnitude interpolation, and any interpolation method can be used. There is usually no reason to expect a ``phase peak'' at a magnitude peak, so simple linear interpolation may be used to interpolate the unwrapped phase samples (given a sufficiently large zero-padding factor). Matlab has an unwrap function for unwrapping phase, and §F.4 provides an Octave-compatible version.

If we do expect a phase peak (such as when identifying chirps, as discussed in §10.6), then we may use quadratic interpolation separately on the (unwrapped) phase.

Alternatively, the real and imaginary parts can be interpolated separately to yield a complex peak value estimate.


Matlab for Parabolic Peak Interpolation

Section §F.2 lists Matlab/Octave code for finding quadratically interpolated peaks in the magnitude spectrum as discussed above. At the heart is the qint function, which contains the following:

function [p,y,a] = qint(ym1,y0,yp1)
%QINT - quadratic interpolation of three adjacent samples
%
% [p,y,a] = qint(ym1,y0,yp1)
%
% returns the extremum location p, height y, and half-curvature a
% of a parabolic fit through three points.
% Parabola is given by y(x) = a*(x-p)^2+b,
% where y(-1)=ym1, y(0)=y0, y(1)=yp1.

p = (yp1 - ym1)/(2*(2*y0 - yp1 - ym1));
y = y0 - 0.25*(ym1-yp1)*p;
a = 0.5*(ym1 - 2*y0 + yp1);


Bias of Parabolic Peak Interpolation

Since the true window transform is not a parabola (except for the conceptual case of a Gaussian window transform expressed in dB), there is generally some error in the interpolated peak due to this mismatch. Such a systematic error in an estimated quantity (due to modeling error, not noise), is often called a bias. Parabolic interpolation is unbiased when the peak occurs at a spectral sample (FFT bin frequency), and also when the peak is exactly half-way between spectral samples (due to symmetry of the window transform about its midpoint). For other peak frequencies, quadratic interpolation yields a biased estimate of both peak frequency and peak amplitude. (Phase is essentially unbiased [1].)

Since zero-padding in the time domain gives ideal interpolation in the frequency domain, there is no bias introduced by this type of interpolation. Thus, if enough zero-padding is used so that a spectral sample appears at the peak frequency, simply finding the largest-magnitude spectral sample will give an unbiased peak-frequency estimator. (We will learn in §5.7.2 that this is also the maximum likelihood estimator for the frequency of a sinusoid in additive white Gaussian noise.)

While we could choose our zero-padding factor large enough to yield any desired degree of accuracy in peak frequency measurements, it is more efficient in practice to combine zero-padding with parabolic interpolation (or some other simple, low-order interpolator). In such hybrid schemes, the zero-padding is simply chosen large enough so that the bias due to parabolic interpolation is negligible. In §5.7 below, the Quadratically Interpolated FFT (QIFFT) method is described as one such hybrid scheme.


Optimal Peak-Finding in the Spectrum

Based on the preceding sections, an ``obvious'' method for deducing sinusoidal parameters from data is to find the amplitude, phase, and frequency of each peak in a zero-padded FFT of the data. We have considered so far the following issues:

  1. Make sure the data length (or window length) is long enough so that all sinusoids in the data are resolved.
  2. Use enough zero padding so that the spectrum is heavily oversampled, making the peaks easier to interpolate.
  3. Use quadratic interpolation of the three samples surrounding a dB-magnitude peak in the heavily oversampled spectrum.
  4. Evaluate the fitted parabola at its extremum to obtain the interpolated amplitude and frequency estimates for each sinusoidal component.
  5. Similarly compute a phase estimate at each peak frequency using quadratic or even linear interpolation on the unwrapped phase samples about the peak.
For future reference, we will call this the quadratically interpolated FFT (QIFFT) method [271].

The question naturally arises as to how good is the QIFFT method for spectral peak estimation? Is it optimal in any sense? Are there better methods? Are there faster methods that are almost as good? These are questions that generally fall under the topic of sinusoidal parameter estimation.

We will show that the QIFFT method is a fast, ``approximate maximum-likelihood method.'' When properly configured, it is in fact extremely close to the true maximum-likelihood estimator for a single sinusoid in white noise. It is also close to the maximum likelihood estimator for multiple sinusoids that are well separated in frequency (i.e., side-lobe overlap can be neglected). Finally, the QIFFT method can be considered optimal perceptually in the sense that any errors induced by the suboptimality of the QIFFT method are inaudible when the zero-padding factor is a factor of 5 or more. While a zero-padding factor of 5 is sufficient for all window types, including the rectangular window, less zero-padding is needed with windows having flatter main-lobe peaks, as summarized in Table 5.3.

Minimum Zero-Padding for High-Frequency Peaks


Table: Minimum zero-padding factors $ L_{\hbox {min}}=N_{\hbox {min}}/M$ for keeping peak-frequency bias below approximately $ \Delta $ percent of the sampling-rate divided by the window length [1]. This table is overly conservative for peak-frequencies below 1 kHz. Here, $ N_{\hbox {min}}$ denotes the minimum FFT length, and $ M$ denotes the window length. The zero-padding therefore consists of $ N_{\hbox {min}}-M$ zeros. $ L_{\hbox {min}}$ is calculated using the formulas in [1] and rounded to two significant digits.
Window Type $ \mathbf{\Delta}$ (%) $ \mathbf{L_{\hbox{min}}}$
Rectangular $ 1$ 2.1
Gen. Hamming $ 1$ 1.2
Blackman $ 1$ $ 1.0$
Rectangular $ 0.1$ 4.1
Gen. Hamming $ 0.1$ 2.4
Blackman $ 0.1$ $ 1.8$


Table 5.3 gives zero-padding factors sufficient for keeping the bias below $ 0.01\cdot\Delta\cdot f_s/M$ Hz, where $ f_s$ denotes the sampling rate in Hz, and $ M$ is the window length in samples. For fundamental frequency estimation, $ \Delta $ can be interpreted as the relative frequency error `` $ \Delta f/f$ '' when the window length is one period. In this case, $ f_s/M$ is the fundamental frequency in Hz. More generally, $ f_s/M$ is the bandwidth of each side-lobe in the DTFT of a length $ M$ rectangular, generalized Hamming, or Blackman window (any member of the Blackman-Harris window family, as elaborated in Chapter 3).

Note from Table 5.3 that the Blackman window requires no zero-padding at all when only $ 1$ % accuracy is required in peak-frequency measurement. It should also be understood that a frequency error of $ 0.1$ % is inaudible in most audio applications.6.10


Minimum Zero-Padding for Low-Frequency Peaks

Sharper bounds on the zero-padding factor needed for low-frequency peaks (below roughly 1 kHz) may be obtained based on the measured Just-Noticeable-Difference (JND) in frequency and/or amplitude [276]. In particular, a $ 0.1$ % relative-error spec is good above 1 kHz (being conservative by approximately a factor of 2), but overly conservative at lower frequencies where the JND flattens out. Below 1 kHz, a fixed 1 Hz spec satisfies perceptual requirements and gives smaller minimum zero-padding factors than the $ 0.1$ % relative-error spec.

The following data, extracted from [276, Table I, p. 89] gives frequency JNDs at a presentation level of 60 dB SPL (the most sensitive case measured):

  f =    [    62,    125,    250,    500,   1000,   2000,   4000];
  dfof = [0.0346, 0.0269, 0.0098, 0.0035, 0.0034, 0.0018, 0.0020];
Thus, the frequency JND at 4 kHz was measured to be two tenths of a percent. (These measurements were made by averaging experimental results for five men between the ages of 20 and 30.) Converting relative frequency to absolute frequency in Hz yields (in matlab syntax):
  df = dfof .* f; % = [2.15, 3.36, 2.45, 1.75, 3.40, 3.60, 8.00];
For purposes of computing the minimum zero-padding factor required, we see that the absolute tuning error due to bias can be limited to 1 Hz, based on measurements at 500 Hz (at 60 dB). Doing this for frequencies below 1 kHz yields the results shown in Table 5.4. Note that the Blackman window needs no zero padding below 125 Hz, and the Hamming/Hann window requires no zero padding below 62.5 Hz.


Table: Minimum zero-padding factors $ L_{\hbox {min}}=N_{\hbox {min}}/M$ for keeping peak-frequency bias below approximately 1 Hz (well under 1.75 Hz), assuming the window length $ M$ to span one period of the fundamental frequency.
Window Type $ f$ (Hz) $ \mathbf{L_{\hbox{min}}}$
Rectangular 1000 4.1
  500 3.3
  250 2.6
  125 2.1
  62.5 1.7
Gen. Hamming 1000 2.4
  500 1.9
  250 1.5
  125 1.2
  62.5 1
Blackman 1000 1.8
  500 1.5
  250 1.2
  125 1
  62.5 1



Matlab for Computing Minimum Zero-Padding Factors

The minimum zero-padding factors in the previous two subsections were computed using the matlab function zpfmin listed in §F.2.4. For example, both tables above are included in the output of the following matlab program:

  windows={'rect','hann','hamming','blackman'};
  freqs=[1000,500,250,125,62.5];
  for i=1:length(windows)
    w = sprintf("%s",windows(i))
    for j=1:length(freqs)
      f = freqs(j);
      zpfmin(w,1/f,0.01*f)  % 1 percent spec (large for audio)
      zpfmin(w,1/f,0.001*f) % 0.1 percent spec (good > 1 kHz)
      zpfmin(w,1/f,1)       % 1 Hz spec (good below 1 kHz)
    end
  end

In addition to ``perceptually exact'' detection of spectral peaks, there are times when we need to find spectral parameters as accurately as possible, irrespective of perception. For example, one can estimate the stiffness of a piano string by measuring the stretched overtone-frequencies in the spectrum of that string's vibration. Additionally, we may have measurement noise, in which case we want our measurements to be minimally influenced by this noise. The following sections discuss optimal estimation of spectral-peak parameters due to sinusoids in the presence of noise.


Least Squares Sinusoidal Parameter Estimation

There are many ways to define ``optimal'' in signal modeling. Perhaps the most elementary case is least squares estimation. Every estimator tries to measure one or more parameters of some underlying signal model. In the case of sinusoidal parameter estimation, the simplest model consists of a single complex sinusoidal component in additive white noise:

$\displaystyle x(n) \isdef {\cal A}e^{j\omega_0 n} + v(n) \protect$ (6.32)

where $ {\cal A}= A e^{j\phi}$ is the complex amplitude of the sinusoid, and $ v(n)$ is white noise (defined in §C.3). Given measurements of $ x(n)$ , $ n=0,1,2,\ldots,N-1$ , we wish to estimate the parameters $ (A,\phi,\omega_0 )$ of this sinusoid. In the method of least squares, we minimize the sum of squared errors between the data and our model. That is, we minimize

$\displaystyle J(\underline{\theta}) \isdef \sum_{n=0}^{N-1}\left\vert x(n)-{\hat x}(n)\right\vert^2 \protect$ (6.33)

with respect to the parameter vector

$\displaystyle \underline{\theta}\isdef \left[\begin{array}{c} A \\ [2pt] \phi \\ [2pt] \omega_0 \end{array}\right],$ (6.34)

where $ {\hat x}(n)$ denotes our signal model:

$\displaystyle {\hat x}(n)\isdef \hat{{\cal A}}e^{j\hat{\omega}_0n}$ (6.35)

Note that the error signal $ x(n)-\hat{{\cal A}}e^{j\hat{\omega}_0n}$ is linear in $ \hat{{\cal A}}$ but nonlinear in the parameter $ \hat{\omega}_0$ . More significantly, $ J(\underline{\theta})$ is non-convex with respect to variations in $ \hat{\omega}_0$ . Non-convexity can make an optimization based on gradient descent very difficult, while convex optimization problems can generally be solved quite efficiently [22,86].

Sinusoidal Amplitude Estimation

If the sinusoidal frequency $ \omega_0$ and phase $ \phi$ happen to be known, we obtain a simple linear least squares problem for the amplitude $ A$ . That is, the error signal

$\displaystyle x(n)-\hat{{\cal A}}e^{j\hat{\omega}_0n} = x(n)-{\hat A}e^{j(\omega_0 n+\phi)}$ (6.36)

becomes linear in the unknown parameter $ {\hat A}$ . As a result, the sum of squared errors

$\displaystyle J({\hat A}) \isdef \sum_{n=0}^{N-1}\left\vert x(n)-{\hat A}e^{j(\omega_0 n+\phi)}\right\vert^2 \protect$ (6.37)

becomes a simple quadratic (parabola) over the real line.6.11 Quadratic forms in any number of dimensions are easy to minimize. For example, the ``bottom of the bowl'' can be reached in one step of Newton's method. From another point of view, the optimal parameter $ {\hat A}$ can be obtained as the coefficient of orthogonal projection of the data $ x(n)$ onto the space spanned by all values of $ {\hat A}$ in the linear model $ {\hat A}
e^{j(\omega_0 n+\phi)}$ .

Yet a third way to minimize (5.37) is the method taught in elementary calculus: differentiate $ J({\hat A})$ with respect to $ {\hat A}$ , equate it to zero, and solve for $ {\hat A}$ . In preparation for this, it is helpful to write (5.37) as

\begin{eqnarray*}
J({\hat A}) &\isdef & \sum_{n=0}^{N-1}
\left[x(n)-{\hat A}e^{j(\omega_0 n+\phi)}\right]
\left[\overline{x(n)}-\overline{{\hat A}} e^{-j(\omega_0 n+\phi)}\right]\\
&=&
\sum_{n=0}^{N-1}
\left[
\left\vert x(n)\right\vert^2
-
x(n)\overline{{\hat A}} e^{-j(\omega_0 n+\phi)}
-
\overline{x(n)}{\hat A}e^{j(\omega_0 n+\phi)}
+
{\hat A}^2
\right]
\\
&=& \left\Vert\,x\,\right\Vert _2^2 - 2\mbox{re}\left\{\sum_{n=0}^{N-1} x(n)\overline{{\hat A}}
e^{-j(\omega_0 n+\phi)}\right\}
+ N {\hat A}^2.
\end{eqnarray*}

Differentiating with respect to $ {\hat A}$ and equating to zero yields

$\displaystyle 0 = \frac{d J({\hat A})}{d{\hat A}} = - 2$re$\displaystyle \left\{\sum_{n=0}^{N-1} x(n) e^{-j(\omega_0 n+\phi)}\right\} + 2N{\hat A}.$ (6.38)

Solving this for $ {\hat A}$ gives the optimal least-squares amplitude estimate

$\displaystyle {\hat A}= \frac{1}{N}$re$\displaystyle \left\{\sum_{n=0}^{N-1} x(n) e^{-j(\omega_0 n+\phi)}\right\} = \frac{1}{N}$re$\displaystyle \left\{\hbox{\sc DTFT}_{\omega_0 }\left[e^{-j\phi} x\right]\right\}. \protect$ (6.39)

That is, the optimal least-squares amplitude estimate may be found by the following steps:
  1. Multiply the data $ x(n)$ by $ e^{-j\phi}$ to zero the known phase $ \phi$ .
  2. Take the DFT of the $ N$ samples of $ x$ , suitably zero padded to approximate the DTFT, and evaluate it at the known frequency $ \omega_0$ .
  3. Discard any imaginary part since it can only contain noise, by (5.39).
  4. Divide by $ N$ to obtain a properly normalized coefficient of projection [264] onto the sinusoid

    $\displaystyle s_{\omega_0 }(n)\isdef e^{j\omega_0 n}.$ (6.40)


Sinusoidal Amplitude and Phase Estimation

The form of the optimal estimator (5.39) immediately suggests the following generalization for the case of unknown amplitude and phase:

$\displaystyle \hat{{\cal A}}= {\hat A}e^{j\hat{\phi}} = \frac{1}{N}\sum_{n=0}^{N-1} x(n) e^{-j\omega_0 n} = \frac{1}{N}\hbox{\sc DTFT}_{\omega_0 }(x) \protect$ (6.41)

That is, $ \hat{{\cal A}}$ is given by the complex coefficient of projection [264] of $ x$ onto the complex sinusoid $ e^{j\omega_0 n}$ at the known frequency $ \omega_0$ . This can be shown by generalizing the previous derivation, but here we will derive it using the more enlightened orthogonality principle [114].

The orthogonality principle for linear least squares estimation states that the projection error must be orthogonal to the model. That is, if $ {\hat x}$ is our optimal signal model (viewed now as an $ N$ -vector in $ {\bf R}^N$ ), then we must have [264]

\begin{eqnarray*}
x-{\hat x}&\perp& {\hat x}\\
\Rightarrow\quad
\left<x-{\hat x},{\hat x}\right> &=& 0\\
\Rightarrow\quad \left<x,{\hat x}\right> &=& \left<{\hat x},{\hat x}\right>\\
\Rightarrow\quad \sum_{n=0}^{N-1}\overline{x(n)} {\hat A}e^{j(\omega_0 n+\hat{\phi})}&=& N {\hat A}^2 \\
\Rightarrow\quad \sum_{n=0}^{N-1}x(n) {\hat A}e^{-j(\omega_0 n+\hat{\phi})}&=& N {\hat A}^2 \\
\Rightarrow\quad \hbox{\sc DTFT}_{\omega_0 }(x)&=&
N \frac{{\hat A}^2}{{\hat A}e^{-j\hat{\phi}}} = N {\hat A}e^{j\hat{\phi}}
\end{eqnarray*}

Thus, the complex coefficient of projection of $ x$ onto $ e^{j{\hat \omega}n}$ is given by

$\displaystyle \hat{{\cal A}}= {\hat A}e^{j\hat{\phi}} = \frac{1}{N} \hbox{\sc DTFT}_{\omega_0 }(x).$ (6.42)

The optimality of $ \hat{{\cal A}}$ in the least squares sense follows from the least-squares optimality of orthogonal projection [114,121,252]. From a geometrical point of view, referring to Fig.5.16, we say that the minimum distance from a vector $ x\in{\bf R}^N$ to some lower-dimensional subspace $ {\bf R}^M$ , where $ M<N$ (here $ M=1$ for one complex sinusoid) may be found by ``dropping a perpendicular'' from $ x$ to the subspace. The point $ {\hat x}$ at the foot of the perpendicular is the point within the subspace closest to $ x$ in Euclidean distance.

Figure: Geometric interpretation of the orthogonal projection of a vector $ x$ in 3D space onto a 2D plane ($ N=3$ , $ M=2$ ). The orthogonal projection $ {\hat x}$ minimizes the Euclidean distance $ \left\Vert\,x-{\hat x}\,\right\Vert$ .
\includegraphics{eps/orthproj}


Sinusoidal Frequency Estimation

The form of the least-squares estimator (5.41) in the known-frequency case immediately suggests the following frequency estimator for the unknown-frequency case:

$\displaystyle \hat{\omega}_0^\ast = \arg\{\max_{\hat{\omega}_0} \left\vert\hbox{\sc DTFT}_{\hat{\omega}_0}(x)\right\vert\}. \protect$ (6.43)

That is, the sinusoidal frequency estimate is defined as that frequency which maximizes the DTFT magnitude. Given this frequency, the least-squares sinusoidal amplitude and phase estimates are given by (5.41) evaluated at that frequency.

It can be shown [121] that (5.43) is in fact the optimal least-squares estimator for a single sinusoid in white noise. It is also the maximum likelihood estimator for a single sinusoid in Gaussian white noise, as discussed in the next section.

In summary,

$\textstyle \parbox{0.8\textwidth}{the least squares estimate for the sinusoidal parameters
of amplitude $A$, phase $\phi$, and frequency $\omega_0 $\ are obtained from
the complex amplitude and location of the magnitude peak in the DTFT
of the zero-padded observation data $x(n)$.}$

In practice, of course, the DTFT is implemented as an interpolated FFT, as described in the previous sections (e.g., QIFFT method).


Maximum Likelihood Sinusoid Estimation

The maximum likelihood estimator (MLE) is widely used in practical signal modeling [121]. A full treatment of maximum likelihood estimators (and statistical estimators in general) lies beyond the scope of this book. However, we will show that the MLE is equivalent to the least squares estimator for a wide class of problems, including well resolved sinusoids in white noise.

Consider again the signal model of (5.32) consisting of a complex sinusoid in additive white (complex) noise:

$\displaystyle x(n) \isdef {\cal A}e^{j\omega_0 n} + v(n) \protect$ (6.44)

Again, $ {\cal A}= A e^{j\phi}$ is the complex amplitude of the sinusoid, and $ v(n)$ is white noise. In addition to assuming $ v$ is white, we add the assumption that it is Gaussian distributed6.12 with zero mean; that is, we assume that its probability density function (see Appendix C) is given by6.13

$\displaystyle p_v(\nu) = \frac{1}{\pi \sigma_v^2} e^{-\frac{\vert\nu\vert^2}{\sigma_v^2}}.$ (6.46)

We express the zero-mean Gaussian assumption by writing

$\displaystyle v(n) \sim {\cal N}(0,\sigma_v^2).$ (6.47)

The parameter $ \sigma_v^2$ is called the variance of the random process $ v(n)$ , and $ \sigma_v$ is called the standard deviation.

It turns out that when Gaussian random variables $ v(n)$ are uncorrelated (i.e., when $ v(n)$ is white noise), they are also independent. This means that the probability of observing particular values of $ v(n)$ and $ v(m)$ is given by the product of their respective probabilities [121]. We will now use this fact to compute an explicit probability for observing any data sequence $ x(n)$ in (5.44).

Since the sinusoidal part of our signal model, $ {\cal A}e^{j\omega_0
n}$ , is deterministic; i.e., it does not including any random components; it may be treated as the time-varying mean of a Gaussian random process $ x(n)$ . That is, our signal model (5.44) can be rewritten as

$\displaystyle x(n) \sim {\cal N}({\cal A}e^{j\omega_0 n},\sigma_v^2)$ (6.48)

and the probability density function for the whole set of observations $ x(n)$ , $ n=0,1,2,\ldots,N-1$ is given by

$\displaystyle p(x) = p[x(0)] p[x(1)]\cdots p[x(N-1)] = \left(\frac{1}{\pi \sigma_v^2}\right)^N e^{-\frac{1}{\sigma_v^2}\sum_{n=0}^{N-1} \left\vert x(n) - {\cal A}e^{j\omega_0 n}\right\vert^2}$ (6.49)

Thus, given the noise variance $ \sigma_v^2$ and the three sinusoidal parameters $ A,\phi,\omega_0 $ (remember that $ {\cal A}= A e^{j\phi}$ ), we can compute the relative probability of any observed data samples $ x(n)$ .


Likelihood Function

The likelihood function $ l_x(\underline{\theta})$ is defined as the probability density function of $ x$ given $ \underline{\theta}=
[A,\phi,\omega_0 ,\sigma_v^2]^T$ , evaluated at a particular $ x$ , with $ \underline{\theta}$ regarded as a variable.

In other words, the likelihood function $ l_x(\underline{\theta})$ is just the PDF of $ x$ with a particular value of $ x$ plugged in, and any parameters in the PDF (mean and variance in this case) are treated as variables.

$\textstyle \parbox{0.8\textwidth}{The \emph{maximum likelihood estimate} of the parameter vector
$\underline{\theta}$\ is defined as the value of $\underline{\theta}$\ which maximizes the
likelihood function $l_x(\underline{\theta})$\ given a particular set of
observations $x$.}$

For the sinusoidal parameter estimation problem, given a set of observed data samples $ x(n)$ , for $ n=0,1,2,\ldots,N-1$ , the likelihood function is

$\displaystyle l_x(\underline{\theta}) = \frac{1}{\pi^N \sigma_v^{2N}} e^{-\frac{1}{\sigma_v^2}\sum_{n=0}^{N-1} \left\vert x(n) - {\cal A}e^{j\omega_0 n}\right\vert^2}$ (6.50)

and the log likelihood function is

$\displaystyle \log l_x(\underline{\theta}) = -N\log(\pi \sigma_v^2) -\frac{1}{\sigma_v^2}\sum_{n=0}^{N-1}\left\vert x(n) - {\cal A}e^{j\omega_0 n}\right\vert^2.$ (6.51)

We see that the maximum likelihood estimate for the parameters of a sinusoid in Gaussian white noise is the same as the least squares estimate. That is, given $ \sigma_v$ , we must find parameters $ {\cal A}$ , $ \phi$ , and $ \omega_0$ which minimize

$\displaystyle J(\underline{\theta}) = \sum_{n=0}^{N-1} \left\vert x(n) - {\cal A}e^{j\omega_0 n}\right\vert^2$ (6.52)

as we saw before in (5.33).

Multiple Sinusoids in Additive Gaussian White Noise

The preceding analysis can be extended to the case of multiple sinusoids in white noise [120]. When the sinusoids are well resolved, i.e., when window-transform side lobes are negligible at the spacings present in the signal, the optimal estimator reduces to finding multiple interpolated peaks in the spectrum.

One exact special case is when the sinusoid frequencies $ \omega_k$ coincide with the ``DFT frequencies'' $ \omega_k =
2\pi k/N$ , for $ k=0,1,2,\ldots,N-1$ . In this special case, each sinusoidal peak sits atop a zero crossing in the window transform associated with every other peak.

To enhance the ``isolation'' among multiple sinusoidal peaks, it is natural to use a window function which minimizes side lobes. However, this is not optimal for short data records since valuable data are ``down-weighted'' in the analysis. Fundamentally, there is a trade-off between peak estimation error due to overlapping side lobes and that due to widening the main lobe. In a practical sinusoidal modeling system, not all sinusoidal peaks are recovered from the data--only the ``loudest'' peaks are measured. Therefore, in such systems, it is reasonable to assure (by choice of window) that the side-lobe level is well below the ``cut-off level'' in dB for the sinusoidal peaks. This prevents side lobes from being comparable in magnitude to sinusoidal peaks, while keeping the main lobes narrow as possible.

When multiple sinusoids are close together such that the associated main lobes overlap, the maximum likelihood estimator calls for a nonlinear optimization. Conceptually, one must search over the possible superpositions of the window transform at various relative amplitudes, phases, and spacings, in order to best ``explain'' the observed data.

Since the number of sinusoids present is usually not known, the number can be estimated by means of hypothesis testing in a Bayesian framework [21]. The ``null hypothesis'' can be ``no sinusoids,'' meaning ``just white noise.''


Non-White Noise

The noise process $ v(n)$ in (5.44) does not have to be white [120]. In the non-white case, the spectral shape of the noise needs to be estimated and ``divided out'' of the spectrum. That is, a ``pre-whitening filter'' needs to be constructed and applied to the data, so that the noise is made white. Then the previous case can be applied.


Generality of Maximum Likelihood Least Squares

Note that the maximum likelihood estimate coincides with the least squares estimate whenever the signal model is of the form

$\displaystyle x(n) = {\hat x}(n) + v(n)$ (6.53)

where $ v(n)$ is zero-mean Gaussian noise, and $ {\hat x}(n)$ is any deterministic model for the mean of $ x(n)$ . This is an extremely general formulation that can be applied in many situations beyond sinusoids in white noise.


Next Section:
Spectrum Analysis of Noise
Previous Section:
FIR Digital Filter Design