Spectral Envelope Extraction

There are many definitions of spectral envelope. Piecewise-linear (or polynomial spline) spectral envelopes (applied to the spectral magnitude of an STFT frame), have been used successfully in sines+noise modeling of audio signals (introduced in §10.4). Here we will consider spectral envelopes defined by the following two methods for computing them:

  1. cepstral windowing to lowpass-filter the log-magnitude spectrum (a ``nonparametric method'')

  2. using linear prediction (a ``parametric method'') to capture spectral shape in the amplitude-response of an all-pole filter in a source-filter decomposition of the signal (where the source signal is defined to be spectrally flat)

In the following, $ X_m(\omega_k)$ denotes the $ m$ th spectral frame of the STFT (§7.1), and $ Y_m(\omega_k)$ denotes the spectral envelope of $ X_m(\omega_k)$ .

Cepstral Windowing

The spectral envelope obtained by cepstral windowing is defined as

$\displaystyle Y_m \eqsp \hbox{\sc DFT}[w \cdot \underbrace{\hbox{\sc DFT}^{-1}\log(\vert X_m\vert)}_{\hbox{real cepstrum}}]$ (11.2)

where $ w$ is a lowpass-window in the cepstral domain. A simple but commonly used lowpass-window is given by

$\displaystyle w(n) \eqsp \left\{\begin{array}{ll} 1, & \vert n\vert< n_c \\ [5pt] 0.5, & \vert n\vert=n_c \\ [5pt] 0, & \vert n\vert>n_c, \\ \end{array} \right.$ (11.3)

where $ n_c$ denotes the lowpass ``cut-off'' sample.

The log-magnitude spectrum of $ X_m$ is thus lowpass filtered (the real cepstrum of $ x$ is ``liftered'') to obtain a smooth spectral envelope. For periodic signals, $ n_c$ should be set below the period in samples.

Cepstral coefficients are typically used in speech recognition to characterize spectral envelopes, capturing primarily the formants (spectral resonances) of speech [227]. In audio applications, a warped frequency axis, such as the ERB scale (Appendix E), Bark scale, or Mel frequency scale is typically preferred. Mel Frequency Cepstral Coefficients (MFCC) appear to remain quite standard in speech-recognition front ends, and they are often used to characterize steady-state spectral timbre in Music Information Retrieval (MIR) applications.

Linear Prediction Spectral Envelope

Linear Prediction (LP) implicitly computes a spectral envelope that is well adapted for audio work, provided the order of the predictor is appropriately chosen. Due to the error minimized by LP, spectral peaks are emphasized in the envelope, as they are in the auditory system. (The peak-emphasis of LP is quantified in (10.10) below.)

The term ``linear prediction'' refers to the process of predicting a signal sample $ y(n)$ based on $ M$ past samples:

$\displaystyle y(n) \eqsp -a_1 y(n-1) - a_2 y(n-2) - \cdots - a_M y(n-M) + e(n) \protect$ (11.4)

We call $ M$ the order of the linear predictor, and $ \{a_i\}_{i=1}^M$ the prediction coefficients. The prediction error (or ``innovations sequence'' [114]) is denoted $ e(n)$ in (10.4), and it represents all new information entering the signal $ y$ at time $ n$ . Because the information is new, $ e(n)$ is ``unpredictable.'' The predictable component of $ y(n)$ contains no new information.

Taking the z transform of (10.4) yields

$\displaystyle Y(z) \eqsp \frac{E(z)}{A(z)}$ (11.5)

where $ A(z) = 1 + a_1z^{-1}+ \cdots a_M z^{-M}$ . In signal modeling by linear prediction, we are given the signal $ y(n)$ but not the prediction coefficients $ a_i$ . We must therefore estimate them. Let $ {\hat A}(z) = 1 + {\hat a}_1z^{-1}
+ \cdots {\hat a}_M z^{-M}$ denote the polynomial with estimated prediction coefficients $ {\hat a}_i$ . Then we have

$\displaystyle Y(z) \eqsp \frac{{\hat E}(z)}{{\hat A}(z)}$ (11.6)

where $ {\hat E}(z)$ denotes the estimated prediction-error z transform. By minimizing $ \vert\vert\,{\hat E}\,\vert\vert _2$ , we define a minimum-least-squares estimate $ {\hat A}$ . In other words, the linear prediction coefficients $ {\hat a}_i$ are defined as those which minimize the sum of squared prediction errors $ {\hat e}(n)$

$\displaystyle \left\Vert\,{\hat e}\,\right\Vert _2^2 \eqsp \sum_n {\hat e}^2(n)$ (11.7)

over some range of $ n$ , typically an interval over which the signal is stationary (defined in Chapter 6). It turns out that this minimization results in maximally flattening the prediction-error spectrum $ E(z)$ [11,157,162]. That is, the optimal $ {\hat A}(z)$ is a whitening filter (also called an inverse filter). This makes sense in terms of Chapter 6 when one considers that a flat power spectral density corresponds to white noise in the time domain, and only white noise is completely unpredictable from one sample to the next. A non-flat spectrum corresponds to a nonzero correlation between two signal samples separated by some nonzero time interval.

If the prediction-error is successfully whitened, then the signal model can be expressed in the frequency domain as

$\displaystyle S_y(\omega) \eqsp \frac{\sigma^2_e}{\vert A(\omega)\vert^2}$ (11.8)

where $ S_y(\omega)$ denotes the power spectral density of $ y$ (defined in Chapter 6), and $ \sigma_e^2$ denotes the variance of the (white-noise) prediction error $ e(n)$ . Thus, the spectral magnitude envelope may be defined as

EnvelopeLPC$\displaystyle _y(\omega) \eqsp \frac{\sigma_e}{\vert A(\omega)\vert}$ (11.9)

Linear Prediction is Peak Sensitive

By Rayleigh's energy theorem, $ \vert\vert\,{\hat e}\,\vert\vert _2= \vert\vert\,{\hat E}\,\vert\vert _2$ (as shown in §2.3.8). Therefore,

$\displaystyle \sum_{n=-\infty}^{\infty} {\hat e}^2(n)$ $\displaystyle =$ $\displaystyle \frac{1}{2\pi}\int_{-\pi}^{\pi}\left\vert{\hat E}\left(e^{j\omega}\right)\right\vert^2 d\omega$  
  $\displaystyle \isdef$ $\displaystyle \frac{1}{2\pi}\int_{-\pi}^{\pi}\left\vert{\hat A}\left(e^{j\omega}\right)Y\left(e^{j\omega}\right)\right\vert^2 d\omega$  
  $\displaystyle =$ $\displaystyle \frac{{\hat\sigma}^2_e}{2\pi}\int_{-\pi}^{\pi}\left\vert\frac{Y\left(e^{j\omega}\right)}%
{{\hat Y}\left(e^{j\omega}\right)}\right\vert^2 d\omega.
\protect$ (11.10)

From this ``ratio error'' expression in the frequency domain, we can see that contributions to the error are smallest when $ \vert{\hat Y}(e^{j\omega})\vert>\vert Y(e^{j\omega})\vert$ . Therefore, LP tends to overestimate peaks. LP cannot make $ \vert{\hat Y}\vert$ arbitrarily large because $ A(z)$ is constrained to be monic and minimum-phase. It can be shown that the log-magnitude frequency response of every minimum-phase monic polynomial $ A(z)$ is zero-mean [162]. Therefore, for each peak overestimation, there must be an equal-area ``valley underestimation'' (in a log-magnitude plot over the unit circle).

Linear Prediction Methods

The two classic methods for linear prediction are called the autocorrelation method and the covariance method [162,157]. Both methods solve the linear normal equations (defined below) using different autocorrelation estimates.

In the autocorrelation method of linear prediction, the covariance matrix is constructed from the usual Bartlett-window-biased sample autocorrelation function (see Chapter 6), and it has the desirable property that $ {\hat A}(z)$ is always minimum phase (i.e., $ 1/{\hat A}(z)$ is guaranteed to be stable). However, the autocorrelation method tends to overestimate formant bandwidths; in other words, the filter model is typically overdamped. This can be attributed to implicitly ``predicting zero'' outside of the signal frame, resulting in the Bartlett-window bias in the sample autocorrelation.

The covariance method of LP is based on an unbiased autocorrelation estimate (see Eq.$ \,$ (6.4)). As a result, it gives more accurate bandwidths, but it does not guarantee stability.

So-called covariance lattice methods and Burg's method were developed to maintain guaranteed stability while giving accuracy comparable to the covariance method of LP [157].

Computation of Linear Prediction Coefficients

In the autocorrelation method of linear prediction, the linear prediction coefficients $ \{a_i\}_{i=1}^M$ are computed from the Bartlett-window-biased autocorrelation function (Chapter 6):

$\displaystyle r_{y_m}(l) \isdefs \sum_{n=-\infty}^\infty y_m(n)y_m(n+l) \eqsp \hbox{\sc DFT}^{-1}\left\vert Y_m\right\vert^2 \protect$ (11.11)

where $ y_m$ denotes the $ m$ th data frame from the signal $ y$ . To obtain the $ M$ th-order linear predictor coefficients $ \{a_1,\ldots,a_M\}$ , we solve the following $ M\times M$ system of linear normal equations (also called Yule-Walker or Wiener-Hopf equations):

$\displaystyle \sum_{i=1}^M a_i r_{y_m}(\vert i-j\vert) \eqsp -r_{y_m}(j), \qquad j=1,2,\ldots,M \protect$ (11.12)

In matlab syntax, the solution is given by `` $ \verb+a=R\p+$ '', where $ \verb+p(j)+ = r_{y_m}(j)$ , and $ \verb+R(i,j)+=r_{y_m}(\vert i-j\vert)$ . Since the covariance matrix $ R$ is symmetric and Toeplitz by construction,11.4 an $ O(M^2)$ solution exists using the Durbin recursion.11.5

If the rank of the $ M\times M$ autocorrelation matrix $ R[i,j]=r_{y_n}(\vert i-j\vert)$ is $ M$ , then the solution to (10.12) is unique, and this solution is always minimum phase [162] (i.e., all roots of $ A(z)$ are inside the unit circle in the $ z$ plane [263], so that $ 1/A(z)$ is always a stable all-pole filter). In practice, the rank of $ R$ is $ M$ (with probability 1) whenever $ y(n)$ includes a noise component. In the noiseless case, if $ y(n)$ is a sum of sinusoids, each (real) sinusoid at distinct frequency $ 0<\omega_i T
< \pi$ adds 2 to the rank. A dc component, or a component at half the sampling rate, adds 1 to the rank of $ R$ .

The choice of time window for forming a short-time sample autocorrelation and its weighting also affect the rank of $ R$ . Equation (10.11) applied to a finite-duration frame yields what is called the autocorrelation method of linear prediction [162]. Dividing out the Bartlett-window bias in such a sample autocorrelation yields a result closer to the covariance method of LP. A matlab example is given in §10.3.3 below.

The classic covariance method computes an unbiased sample covariance matrix by limiting the summation in (10.11) to a range over which $ y_m(n+l)$ stays within the frame--a so-called ``unwindowed'' method. The autocorrelation method sums over the whole frame and replaces $ y_m(n+l)$ by zero when $ n+l$ points outside the frame--a so-called ``windowed'' method (windowed by the rectangular window).

Linear Prediction Order Selection

For computing spectral envelopes via linear prediction, the order $ M$ of the predictor should be chosen large enough that the envelope can follow the contour of the spectrum, but not so large that it follows the spectral ``fine structure'' on a scale not considered to belong in the envelope. In particular, for voice, $ M$ should be twice the number of spectral formants, and perhaps a little larger to allow more detailed modeling of spectral shape away from the formants. For a sum of quasi sinusoids, the order $ M$ should be significantly less than twice the number of sinusoids to inhibit modeling the sinusoids as spectral-envelope peaks. For filtered-white-noise, $ M$ should be close to the order of the filter applied to the white noise, and so on.

Summary of LP Spectral Envelopes

In summary, the spectral envelope of the $ m$ th spectral frame, computed by linear prediction, is given by

$\displaystyle {\hat Y}_m(\omega_k) \eqsp \frac{{\hat g}_m}{\left\vert{\hat A}_m\left(e^{j\omega_k }\right)\right\vert}$ (11.13)

where $ {\hat A}_m$ is computed from the solution of the Toeplitz normal equations, and $ {\hat g}_m = \vert\vert\,{\hat E}_m\,\vert\vert _2$ is the estimated rms level of the prediction error in the $ m$ th frame.

The stable, all-pole filter

$\displaystyle \frac{{\hat g}_m}{{\hat A}_m(z)}$ (11.14)

can be driven by unit-variance white noise to produce a filtered-white-noise signal having spectral envelope $ {\hat g}_m/\vert{\hat A}_m(e^{j\omega_k })\vert$ . We may regard $ {\hat g}_m/{\hat A}_m(e^{j\omega_k })$ (no absolute value) as the frequency response of the filter in a source-filter decomposition of the signal $ y_m(n)$ , where the source is white noise.

It bears repeating that $ \log A(e^{j\omega_k })$ is zero mean when $ A(z)$ is monic and minimum phase (all zeros inside the unit circle). This means, for example, that $ \log {\hat g}_m$ can be simply estimated as the mean of the log spectral magnitude $ \log \vert Y_m(e^{j\omega_k })\vert$ .

For best results, the frequency axis ``seen'' by linear prediction should be warped to an auditory frequency scale, as discussed in Appendix E [123]. This has the effect of increasing the accuracy of low-frequency peaks in the extracted spectral envelope, in accordance with the nonuniform frequency resolution of the inner ear.

Spectral Envelope Examples

This section presents matlab code for computing spectral envelopes by the cepstral and linear prediction methods discussed above. The signal to be modeled is a synthetic ``ah'' vowel (as in ``father'') synthesized using three formants driven by a bandlimited impulse train [128].

Signal Synthesis

% Specify formant resonances for an "ah" [a] vowel:
F = [700, 1220, 2600]; % Formant frequencies in Hz
B = [130,   70,  160]; % Formant bandwidths in Hz

fs = 8192;  % Sampling rate in Hz
	    % ("telephone quality" for speed)
R = exp(-pi*B/fs);     % Pole radii
theta = 2*pi*F/fs;     % Pole angles
poles = R .* exp(j*theta);
[B,A] = zp2tf(0,[poles,conj(poles)],1);

f0 = 200; % Fundamental frequency in Hz
w0T = 2*pi*f0/fs;

nharm = floor((fs/2)/f0); % number of harmonics
nsamps = fs;  % make a second's worth
sig = zeros(1,nsamps);
n = 0:(nsamps-1);
% Synthesize bandlimited impulse train:
for i=1:nharm,
    sig = sig + cos(i*w0T*n);
sig = sig/max(sig);
soundsc(sig,fs); % Let's hear it

% Now compute the speech vowel:
speech = filter(1,A,sig);
soundsc([sig,speech],fs); % "buzz", "ahh"
% (it would sound much better with a little vibrato)

The Hamming-windowed bandlimited impulse train sig and its spectrum are plotted in Fig.10.1.

Figure 10.1: Bandlimited impulse train.
\includegraphics[width=\textwidth ]{eps/ImpulseTrain}

Figure 10.2 shows the Hamming-windowed synthesized vowel speech, and its spectrum overlaid with the true formant envelope.

Figure 10.2: Synthetic vowel in time and frequency domains, with formant envelope overlaid.
\includegraphics[width=\textwidth ]{eps/Speech}

Spectral Envelope by the Cepstral Windowing Method

We now compute the log-magnitude spectrum, perform an inverse FFT to obtain the real cepstrum, lowpass-window the cepstrum, and perform the FFT to obtain the smoothed log-magnitude spectrum:

Nframe = 2^nextpow2(fs/25); % frame size = 40 ms
w = hamming(Nframe)';
winspeech = w .* speech(1:Nframe);
Nfft = 4*Nframe; % factor of 4 zero-padding
sspec = fft(winspeech,Nfft);
dbsspecfull = 20*log(abs(sspec));
rcep = ifft(dbsspecfull);  % real cepstrum
rcep = real(rcep); % eliminate round-off noise in imag part
period = round(fs/f0) % 41
nspec = Nfft/2+1;
aliasing = norm(rcep(nspec-10:nspec+10))/norm(rcep) % 0.02
nw = 2*period-4; % almost 1 period left and right
if floor(nw/2) == nw/2, nw=nw-1; end; % make it odd
w = boxcar(nw)'; % rectangular window
wzp = [w(((nw+1)/2):nw),zeros(1,Nfft-nw), ...
       w(1:(nw-1)/2)];  % zero-phase version
wrcep = wzp .* rcep;  % window the cepstrum ("lifter")
rcepenv = fft(wrcep); % spectral envelope
rcepenvp = real(rcepenv(1:nspec)); % should be real
rcepenvp = rcepenvp - mean(rcepenvp); % normalize to zero mean

Figure 10.3 shows the real cepstrum of the synthetic ``ah'' vowel (top) and the same cepstrum truncated to just under a period in length. In theory, this leaves only formant envelope information in the cepstrum. Figure 10.4 shows an overlay of the spectrum, true envelope, and cepstral envelope.

Figure 10.3: Real cepstrum (top) and windowed cepstrum (bottom).
\includegraphics[width=\textwidth ]{eps/CepstrumBoxcar}

Figure 10.4: Overlay of spectrum, true envelope, and cepstral envelope.
\includegraphics[width=\textwidth ]{eps/CepstrumEnvBoxcarC}

Instead of simply truncating the cepstrum (a rectangular windowing operation), we can window it more gracefully. Figure 10.5 shows the result of using a Hann window of the same length. The spectral envelope is smoother as a result.

Figure 10.5: Overlay of spectrum, true envelope, and cepstral envelope.
\includegraphics[width=\textwidth ]{eps/CepstrumEnvHanningC}

Spectral Envelope by Linear Prediction

Finally, let's do an LPC window. It had better be good because the LPC model is exact for this example.

M = 6; % Assume three formants and no noise

% compute Mth-order autocorrelation function:
rx = zeros(1,M+1)';
for i=1:M+1,
  rx(i) = rx(i) + speech(1:nsamps-i+1) ...
                * speech(1+i-1:nsamps)';

% prepare the M by M Toeplitz covariance matrix:
covmatrix = zeros(M,M);
for i=1:M,
  covmatrix(i,i:M) = rx(1:M-i+1)';
  covmatrix(i:M,i) = rx(1:M-i+1);

% solve "normal equations" for prediction coeffs:

Acoeffs = - covmatrix \ rx(2:M+1)

Alp = [1,Acoeffs']; % LP polynomial A(z)

dbenvlp = 20*log10(abs(freqz(1,Alp,nspec)'));
dbsspecn = dbsspec + ones(1,nspec)*(max(dbenvlp) ...
                   - max(dbsspec)); % normalize
plot(f,[max(dbsspecn,-100);dbenv;dbenvlp]); grid;

Figure 9.16:
\includegraphics[width=\textwidth ]{eps/LinearPredictionEnvC}

Linear Prediction in Matlab and Octave

In the above example, we implemented essentially the covariance method of LP directly (the autocorrelation estimate was unbiased). The code should run in either Octave or Matlab with the Signal Processing Toolbox.

The Matlab Signal Processing Toolbox has the function lpc available. (LPC stands for ``Linear Predictive Coding.'')

The Octave-Forge lpc function (version 20071212) is a wrapper for the lattice function which implements Burg's method by default. Burg's method has the advantage of guaranteeing stability ($ A(z)$ is minimum phase) while yielding accuracy comparable to the covariance method. By uncommenting lines in lpc.m, one can instead use the ``geometric lattice'' or classic autocorrelation method (called ``Yule-Walker'' in lpc.m). For details, ``type lpc''.

Next Section:
Spectral Modeling Synthesis
Previous Section: