Classic Spectrograms

The spectrogram is a basic tool in audio spectral analysis and other applications. It has been used extensively in speech analysis [56]. The spectrogram can be defined as an intensity plot (usually on a log scale, such as dB) of the Short-Time Fourier Transform (STFT) magnitude.8.4 As defined in the previous section, the STFT is simply a sequence of FFTs of windowed data segments, where the windows are allowed to overlap in time, typically by at least 50% [9]. Parameters of the spectrogram include the

  • window length $ M$ ,
  • window type (Hamming, Kaiser, etc.),
  • hop-size $ R$ , and
  • FFT length $ N$ .
As discussed in Chapter 5, the window length $ M$ controls frequency resolution, the window type controls side-lobe suppression (at the expense of resolution when $ M$ is fixed), and the FFT length $ N$ determines how much spectral oversampling (interpolation) is to be provided. The new hop-size parameter $ R$ determines how much oversampling there will be along the time dimension. For $ R=1$ (the ``sliding FFT''), there is no downsampling over time, so oversampling is maximized. For a periodic Hamming window, $ R=(M-1)/2$ gives maximum downsampling of the sliding FFT without time aliasing. Avoiding time aliasing corresponds to retaining ``robust perfect reconstruction'' in the inverse STFT.8.5

The spectrogram is an important representation of audio data because human hearing is based on a kind of real-time spectrogram encoded by the cochlea of the inner ear [199]. The spectrogram has been used extensively in the field of computer music as a guide during the development of sound synthesis algorithms. When working with an appropriate synthesis model, matching the spectrogram often corresponds to matching the sound extremely well. In fact, spectral modeling synthesis (SMS) is based on synthesizing the short-time spectrum directly by some means (see §10.4) [303].

Spectrogram of Speech

Figure 7.2: Classic spectrogram of speech sample.
\includegraphics[width=\twidth]{eps/speechspgm}

An example spectrogram for recorded speech data is shown in Fig.7.2. It was generated using the Matlab code displayed in Fig.7.3. The function spectrogram is listed in §F.3. The spectrogram is computed as a sequence of FFTs of windowed data segments. The spectrogram is plotted within spectrogram using imagesc.

Figure 7.3: Matlab for computing a speech spectrogram.

 
[y,fs,bits] = wavread('SpeechSample.wav');
soundsc(y,fs); % Let's hear it
% for classic look:
colormap('gray'); map = colormap; imap = flipud(map);
M = round(0.02*fs);  % 20 ms window is typical
N = 2^nextpow2(4*M); % zero padding for interpolation
w = hamming(M);
spectrogram(y,N,fs,w,-M/8,1,60);
title('Speech Sample Spectrogram');
colormap(imap);

In this example, the Hamming window length was chosen to be 20 ms--a common choice in speech analysis. This is short enough so that any single 20 ms frame will typically contain data from only one phoneme, yet long enough that it will include at least two periods of the fundamental frequency during voiced speech, assuming the lowest voiced pitch to be around 100 Hz.

More generally, for speech and the singing voice (and any periodic tone), the STFT analysis parameters are chosen to trade off among the following conflicting criteria:

  1. The harmonics should be resolved.
  2. Pitch and formant variations should be closely followed.
The formants in speech are the low-frequency resonances in the vocal tract. They appear as dark groups of harmonics in Fig.7.2. The first two formants largely determine the ``vowel'' in voiced speech. In telephone speech, nominally between 200 and 3200 Hz, only three or four formants are usually present in the band.


Next Section:
Audio Spectrograms
Previous Section:
The Short-Time Fourier Transform