It seems to me that cepstral mean subtraction (CMS), which is commonly
performed in computer speech recognition system, corresponds to a
circular convolution. I am curious what people think about how I'm
viewing this and about the possible effects of changing CMS to use a
linear convolution.
This post turned out very long, so if you're in a hurry please just
read from the fourth paragraph to the eighth paragraph.
I'll start by giving an example of what I consider to be standard
practice in DSP. In the Audacity audio editor there is an
Equalization filter dialog in which the user can draw the magnitude
response of a filter with the mouse. This gives a magnitude spectrum
M(k) with N points. I don't suppose that Audacity processes the audio
by taking N points of audio at a time, calculating the DFT of those N
points, multiplying it by
M(k), and then going back to the time domain by calculating the
IDFT. That approach would result in a circular convolution rather
than linear convolution. Instead, Audacity presumably either
converts M(k) to the time domain and does linear convolution in the
time domain, or uses zero padding so that multiplication in the DFT
domain will correspond to linear convolution. (A quick look at the
source code suggests it uses the latter approach.)
Computer speech recognition systems normally window the signal (e.g.,
with 25 millisecond Hamming windows) to obtain a sequence of frames,
and calculate a feature vector (i.e., a signal description) for each
frame. Usually, the feature vector is made of Mel-Frequency Cepstral
Coefficients (MFCCs), which are a type of smoothed spectral
representation. It is common to process the MFCCs using CMS to
reduce the effects of convolutional distortions. (I'll explain the
idea behind CMS at the bottom of this message, for those who are not
familiar with it.)
To simplify the discussion, I'll pretend CMS is done before the DCT
stage of MFCC calculation, instead of after the DCT. (I guess this
does not affect the final values of the "static" MFCC features, due to
the linearity of the DCT. I guess that's also true for the "delta"
features, but I'm less sure about those, so I'll just ignore them in
this discussion.) In this case, each feature vector contains 23 (or
so) logarithmic, Mel-warped spectral magnitude values. (The Mel-
warping is a warping of the frequency axis.) To perform CMS, we find
the mean M of the feature vectors, and then subtract M from each
feature vector. This is meant to remove the effect of a
fixed (i.e., not time varying) convolutional distortion. The
distortion could come from a microphone, a telephone connection, room
acoustics, and so on. The subtraction
of M from all the logarithmic spectral vectors is equivalent to
dividing the corresponding linear spectral vectors by exp(M), and
that is equivalent to multiplying the linear spectral vectors by 1/
exp(M). So it seems that an equivalent way to look
at CMS is that we design a 23-point filter 1/exp(M), and then apply it
to the linear spectral vectors by multiplying the 23 data points from
each of those vectors by the 23 filter points. But that corresponds
to circular convolution, right?
So I'm wondering: how would the effect of the processing be different
if we instead transformed 1/exp(M) to the time domain, did a linear
convolution of the input waveform with 1/exp(M), and then calculated
our final MFCCs using the output of that linear convolution? I'm
curious what people think.
One reason I suppose the proposed approach might be better is that the
effect of the convolutional distortion can cross frame boundaries.
If a phoneme (speech sound) has length L_p, and the convolution
distortion has length L_c, then the convolution of the two will have
length L_c + L_p - 1. Thus the convolution can cause speech
information that was previously confined to one frame to spread across
a frame boundary. When we divide by 1/exp(M), we are trying to undo
the effects of the convolutional distortion, but we are dividing each
frame by 1/exp(M) separately, so there is no way to push energy back
from a frame to the previous frame.
One problem with the proposed approach is that we don't have a phase
spectrum for exp(M) to use when transforming exp(M) to the time
domain. I hope assigning a minimum-phase phase spectrum to exp(M)
would be reasonable, because min phase corresponds to minimum energy
delay, and I guess many convolutional channels will have impulse
responses which start strong and then decay quickly. Given a
magnitude spectrum, a minimum-phase phase spectrum can be computed
quickly. (I have Matlab code for it.) If you have another idea for
the phase spectrum, I would like to hear about it.
Alternatively, since the MFCC values are based only on the magnitude
spectrum of the framed data and not on the phase spectrum of the
framed data, maybe there is a way to find the MFCC values resulting
from linear convolution without needing to assign a phase spectrum to
exp(M)? If this is possible, I suppose it would be better than
assuming a min-phase phase spectrum.
When I wrote, "But that corresponds to circular convolution, right?",
I glossed over the effect of the Mel-warping. I don't know if or how
the Mel-warping interacts with whether or not circular convolution
occurs. However, in any case, it's possible to do something very
similar to CMS before the Mel warping, by doing the mean calculation
and subtraction with a logarithmic, unwarped magnitude spectra instead
of a
logarithmic, Mel-warped magnitude spectra. This alternative to CMS
has been named log-DFT mean normalization (LDMN) and was discussed in
C. Avendano and H. Hermansky 's paper "On the Effects of Short-Term
Spectrum Smoothing in Channel Normalization" (IEEE Transactions on
Speech and Audio Processing, 1997; paper "avendano97c" at
http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all)
and Neumeyer et al.'s paper "Training Issues and Channel Equalization
Techniques for the Construction of Telephone Acoustic Models Using a
High-Quality Speech Corpus" (IEEE Transactions on Speech and Audio
Processing, 1994; http://citeseer.ist.psu.edu/278439.html).
In fact, I have unpublished results showing LDMN sometimes performs
better than CMS, although in most of my tests it did not make a
statistically significant difference whether I did LDMN or did
CMS. (By the way, the reason I did not publish those results is
because I felt I did not have enough comparisons to make a good
paper. Please get in touch
with me if you are interested in trying to co-author a good paper by
running more LDMN and CMS comparisons to augment the ones I've already
done.)
If doing the mean subtraction before the Mel-warping is not
attractive, I note that it is possible to transform a Mel-warped
spectrum to the time domain. I have seen this done as part of the Mel-
warped Wiener filter in Agarwal and Cheng's ASRU 99 paper and in the
ETSI ADSR standard (http://webapp.etsi.org/WorkProgram/
Report_WorkItem.asp?WKI_ID=25817).
Thanks for your time,
David Gelbart
PS Here is the explanation of CMS that I promised. I'll ignore the
Mel-warping and the DCT, so this will be a discussion of LDMN, not
CMS. But the same ideas carry over to CMS (although, as discussed in
the papers on LDMN cited above, doing the subtraction after the Mel-
warping seems to require making additional assumptions).
Say we have a time-domain signal y which is the convolution of a
speech signal x and a convolutional distortion c. We window our
signal (e.g., with a Hamming window) to obtain a sequence of frames.
Let's assume that c does not change from frame to frame.
Let's further assume that, therefore, in the DFT domain, we have |
Y(n,k)| = |X(n,k) C(k)| where n is the frame index and k is the
frequency bin. (By the way, this assumption that convolution becomes
multiplication is inaccurate, because it ignores the effect that the
window function used for framing has on Y. This is discussed in
Avendano's PhD thesis, which is available as paper "avendano97a" a
http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all.
But this assumption is useful for deriving LDMN and CMS. The
assumption will be more accurate when the window function is long
compared to the length of c(n).)
Thus, log|Y(n,k)| = log|X(n,k)| + log|C(k)|. Therefore the mean over n
of log|Y(n,k)| is M(k) = F(k) + log|C(k)|, where F(k) =
mean_over_n( log|X(n,k)| ).
If we subtract M(k) from log|Y(n,k)|, the
new frames will have the value Z(n,k) = log|Y(n,k)| - M(k) = log |
X(n,k)| - F(k). So we have removed the effect of the convolutional
distortion c. We have also removed F(k), which wasn't our goal, but
if there are enough frames then F(k) will be a long-term average that
does not carry much information about the words that were spoken. I
suppose that we can, if we like, view F as corresponding to a
hypothetical convolutional distortion applied to a hypothetical
speech signal z which corresponds to Z(n,k).