It seems to me that cepstral mean subtraction (CMS), which is commonly performed in computer speech recognition system, corresponds to a circular convolution. I am curious what people think about how I'm viewing this and about the possible effects of changing CMS to use a linear convolution. This post turned out very long, so if you're in a hurry please just read from the fourth paragraph to the eighth paragraph. I'll start by giving an example of what I consider to be standard practice in DSP. In the Audacity audio editor there is an Equalization filter dialog in which the user can draw the magnitude response of a filter with the mouse. This gives a magnitude spectrum M(k) with N points. I don't suppose that Audacity processes the audio by taking N points of audio at a time, calculating the DFT of those N points, multiplying it by M(k), and then going back to the time domain by calculating the IDFT. That approach would result in a circular convolution rather than linear convolution. Instead, Audacity presumably either converts M(k) to the time domain and does linear convolution in the time domain, or uses zero padding so that multiplication in the DFT domain will correspond to linear convolution. (A quick look at the source code suggests it uses the latter approach.) Computer speech recognition systems normally window the signal (e.g., with 25 millisecond Hamming windows) to obtain a sequence of frames, and calculate a feature vector (i.e., a signal description) for each frame. Usually, the feature vector is made of Mel-Frequency Cepstral Coefficients (MFCCs), which are a type of smoothed spectral representation. It is common to process the MFCCs using CMS to reduce the effects of convolutional distortions. (I'll explain the idea behind CMS at the bottom of this message, for those who are not familiar with it.) To simplify the discussion, I'll pretend CMS is done before the DCT stage of MFCC calculation, instead of after the DCT. (I guess this does not affect the final values of the "static" MFCC features, due to the linearity of the DCT. I guess that's also true for the "delta" features, but I'm less sure about those, so I'll just ignore them in this discussion.) In this case, each feature vector contains 23 (or so) logarithmic, Mel-warped spectral magnitude values. (The Mel- warping is a warping of the frequency axis.) To perform CMS, we find the mean M of the feature vectors, and then subtract M from each feature vector. This is meant to remove the effect of a fixed (i.e., not time varying) convolutional distortion. The distortion could come from a microphone, a telephone connection, room acoustics, and so on. The subtraction of M from all the logarithmic spectral vectors is equivalent to dividing the corresponding linear spectral vectors by exp(M), and that is equivalent to multiplying the linear spectral vectors by 1/ exp(M). So it seems that an equivalent way to look at CMS is that we design a 23-point filter 1/exp(M), and then apply it to the linear spectral vectors by multiplying the 23 data points from each of those vectors by the 23 filter points. But that corresponds to circular convolution, right? So I'm wondering: how would the effect of the processing be different if we instead transformed 1/exp(M) to the time domain, did a linear convolution of the input waveform with 1/exp(M), and then calculated our final MFCCs using the output of that linear convolution? I'm curious what people think. One reason I suppose the proposed approach might be better is that the effect of the convolutional distortion can cross frame boundaries. If a phoneme (speech sound) has length L_p, and the convolution distortion has length L_c, then the convolution of the two will have length L_c + L_p - 1. Thus the convolution can cause speech information that was previously confined to one frame to spread across a frame boundary. When we divide by 1/exp(M), we are trying to undo the effects of the convolutional distortion, but we are dividing each frame by 1/exp(M) separately, so there is no way to push energy back from a frame to the previous frame. One problem with the proposed approach is that we don't have a phase spectrum for exp(M) to use when transforming exp(M) to the time domain. I hope assigning a minimum-phase phase spectrum to exp(M) would be reasonable, because min phase corresponds to minimum energy delay, and I guess many convolutional channels will have impulse responses which start strong and then decay quickly. Given a magnitude spectrum, a minimum-phase phase spectrum can be computed quickly. (I have Matlab code for it.) If you have another idea for the phase spectrum, I would like to hear about it. Alternatively, since the MFCC values are based only on the magnitude spectrum of the framed data and not on the phase spectrum of the framed data, maybe there is a way to find the MFCC values resulting from linear convolution without needing to assign a phase spectrum to exp(M)? If this is possible, I suppose it would be better than assuming a min-phase phase spectrum. When I wrote, "But that corresponds to circular convolution, right?", I glossed over the effect of the Mel-warping. I don't know if or how the Mel-warping interacts with whether or not circular convolution occurs. However, in any case, it's possible to do something very similar to CMS before the Mel warping, by doing the mean calculation and subtraction with a logarithmic, unwarped magnitude spectra instead of a logarithmic, Mel-warped magnitude spectra. This alternative to CMS has been named log-DFT mean normalization (LDMN) and was discussed in C. Avendano and H. Hermansky 's paper "On the Effects of Short-Term Spectrum Smoothing in Channel Normalization" (IEEE Transactions on Speech and Audio Processing, 1997; paper "avendano97c" at http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all) and Neumeyer et al.'s paper "Training Issues and Channel Equalization Techniques for the Construction of Telephone Acoustic Models Using a High-Quality Speech Corpus" (IEEE Transactions on Speech and Audio Processing, 1994; http://citeseer.ist.psu.edu/278439.html). In fact, I have unpublished results showing LDMN sometimes performs better than CMS, although in most of my tests it did not make a statistically significant difference whether I did LDMN or did CMS. (By the way, the reason I did not publish those results is because I felt I did not have enough comparisons to make a good paper. Please get in touch with me if you are interested in trying to co-author a good paper by running more LDMN and CMS comparisons to augment the ones I've already done.) If doing the mean subtraction before the Mel-warping is not attractive, I note that it is possible to transform a Mel-warped spectrum to the time domain. I have seen this done as part of the Mel- warped Wiener filter in Agarwal and Cheng's ASRU 99 paper and in the ETSI ADSR standard (http://webapp.etsi.org/WorkProgram/ Report_WorkItem.asp?WKI_ID=25817). Thanks for your time, David Gelbart PS Here is the explanation of CMS that I promised. I'll ignore the Mel-warping and the DCT, so this will be a discussion of LDMN, not CMS. But the same ideas carry over to CMS (although, as discussed in the papers on LDMN cited above, doing the subtraction after the Mel- warping seems to require making additional assumptions). Say we have a time-domain signal y which is the convolution of a speech signal x and a convolutional distortion c. We window our signal (e.g., with a Hamming window) to obtain a sequence of frames. Let's assume that c does not change from frame to frame. Let's further assume that, therefore, in the DFT domain, we have | Y(n,k)| = |X(n,k) C(k)| where n is the frame index and k is the frequency bin. (By the way, this assumption that convolution becomes multiplication is inaccurate, because it ignores the effect that the window function used for framing has on Y. This is discussed in Avendano's PhD thesis, which is available as paper "avendano97a" a http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all. But this assumption is useful for deriving LDMN and CMS. The assumption will be more accurate when the window function is long compared to the length of c(n).) Thus, log|Y(n,k)| = log|X(n,k)| + log|C(k)|. Therefore the mean over n of log|Y(n,k)| is M(k) = F(k) + log|C(k)|, where F(k) = mean_over_n( log|X(n,k)| ). If we subtract M(k) from log|Y(n,k)|, the new frames will have the value Z(n,k) = log|Y(n,k)| - M(k) = log | X(n,k)| - F(k). So we have removed the effect of the convolutional distortion c. We have also removed F(k), which wasn't our goal, but if there are enough frames then F(k) will be a long-term average that does not carry much information about the words that were spoken. I suppose that we can, if we like, view F as corresponding to a hypothetical convolutional distortion applied to a hypothetical speech signal z which corresponds to Z(n,k).

# Convolution and cepstral mean subtraction

Started by ●March 9, 2007

Reply by ●March 9, 20072007-03-09

Sorry, the URL http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all should have been http://www.bme.ogi.edu/~hynek/cgi-bin/publications/showbib_asp.pl?all both times it occurred. That's the URL for the Avendano papers that I referenced. A shorter URL, http://www.bme.ogi.edu/~hynek/publications, also works.

Reply by ●March 10, 20072007-03-10

On 10 Mar, 00:16, "David Gelbart" <for-spam-o...@mailinator.com> wrote:> It seems to me that cepstral mean subtraction (CMS), which is commonly > performed in computer speech recognition system, corresponds to a > circular convolution.I wrote a lengthy response this morning, but it seems it might have become lost somewhere. The short version is that the circular vs linear convolution issue comes up whenever one uses the DFT/IDFT transform pair. If circular convolution is a concern, then the user has two choises: 1) Not use the DFT/IDFT pair 2) Deal with the issues that arise when using the DFT/IDFT pair. As it happens, it is impossible to compute the cepstrum without using the DFT/IDFT pair -- twice, actually -- so one has no choise but to identify any problems caused by circular convolution and then deal with them accordingly. Rune

Reply by ●March 10, 20072007-03-10

On Mar 9, 6:16 pm, "David Gelbart" <for-spam-o...@mailinator.com> wrote:> It seems to me that cepstral mean subtraction (CMS), which is commonly > performed in computer speech recognition system, corresponds to a > circular convolution. I am curious what people think about how I'm > viewing this and about the possible effects of changing CMS to use a > linear convolution. > > This post turned out very long, so if you're in a hurry please just > read from the fourth paragraph to the eighth paragraph. > > I'll start by giving an example of what I consider to be standard > practice in DSP. In the Audacity audio editor there is an > Equalization filter dialog in which the user can draw the magnitude > response of a filter with the mouse. This gives a magnitude spectrum > M(k) with N points. I don't suppose that Audacity processes the audio > by taking N points of audio at a time, calculating the DFT of those N > points, multiplying it by > M(k), and then going back to the time domain by calculating the > IDFT. That approach would result in a circular convolution rather > than linear convolution. Instead, Audacity presumably either > converts M(k) to the time domain and does linear convolution in the > time domain, or uses zero padding so that multiplication in the DFT > domain will correspond to linear convolution. (A quick look at the > source code suggests it uses the latter approach.)sounds like it could be "fast convolution" which is something we discuss here and is in the texts like O&S. but one thing to keep in mind, not any arbitrary old function for M(k) can result in a short enough FIR to fit in the length of the zero-padding you mention. only a certain class of M(k) works, so it couldn't be both fast convolution and "draw your own frequency response", unless there is a step in there that time-limits the impulse response and ends up fudging your M(k).> Computer speech recognition systems normally window the signal (e.g., > with 25 millisecond Hamming windows) to obtain a sequence of frames, > and calculate a feature vector (i.e., a signal description) for each > frame. Usually, the feature vector is made of Mel-Frequency Cepstral > Coefficients (MFCCs), which are a type of smoothed spectral > representation. It is common to process the MFCCs using CMS to > reduce the effects of convolutional distortions. (I'll explain the > idea behind CMS at the bottom of this message, for those who are not > familiar with it.) > > To simplify the discussion, I'll pretend CMS is done before the DCT > stage of MFCC calculation, instead of after the DCT. (I guess this > does not affect the final values of the "static" MFCC features, due to > the linearity of the DCT. I guess that's also true for the "delta" > features, but I'm less sure about those, so I'll just ignore them in > this discussion.) In this case, each feature vector contains 23 (or > so) logarithmic, Mel-warped spectral magnitude values. (The Mel- > warping is a warping of the frequency axis.) To perform CMS, we find > the mean M of the feature vectors, and then subtract M from each > feature vector. This is meant to remove the effect of a > fixed (i.e., not time varying) convolutional distortion. The > distortion could come from a microphone, a telephone connection, room > acoustics, and so on. The subtraction > of M from all the logarithmic spectral vectors is equivalent to > dividing the corresponding linear spectral vectors by exp(M), and > that is equivalent to multiplying the linear spectral vectors by 1/ > exp(M). So it seems that an equivalent way to look > at CMS is that we design a 23-point filter 1/exp(M), and then apply it > to the linear spectral vectors by multiplying the 23 data points from > each of those vectors by the 23 filter points. But that corresponds > to circular convolution, right? > > So I'm wondering: how would the effect of the processing be different > if we instead transformed 1/exp(M) to the time domain, did a linear > convolution of the input waveform with 1/exp(M), and then calculated > our final MFCCs using the output of that linear convolution? I'm > curious what people think.and then slowly slew the coefs to the linear FIR for each frame and you don't have frame-edge effects (that you would get with circular convolution).> One reason I suppose the proposed approach might be better is that the > effect of the convolutional distortion can cross frame boundaries.but not in your case since you're doing linear convolution.> If a phoneme (speech sound) has length L_p, and the convolution > distortion has length L_c, then the convolution of the two will have > length L_c + L_p - 1. Thus the convolution can cause speech > information that was previously confined to one frame to spread across > a frame boundary.but the frame boundaries happen at sorta random times from the POV of the audio. this information is likely spread across these artifically placed boundaries anyway. using uniformly-spaced frames is like uniform sampling of the parameters (like the magnitudes of spectral components), so those parameters aren't allowed to change very rapidly relative to the frame rate lest it not be fast enough. there is more to your post to read. feel free to restate whatever i'm missing ahead. i need more time to read and to "get" it.

Reply by ●March 10, 20072007-03-10

On 10 Mar, 07:54, "robert bristow-johnson" <r...@audioimagination.com> wrote:> On Mar 9, 6:16 pm, "David Gelbart" <for-spam-o...@mailinator.com>> > Thus the convolution can cause speech > > information that was previously confined to one frame to spread across > > a frame boundary. > > but the frame boundaries happen at sorta random times from the POV of > the audio. this information is likely spread across these artifically > placed boundaries anyway.If you do cepstrum processing, you have no choise but to use the DFT/IDFT pair to get between time domain and spectrum domain. I am not too worried about circular convolution caused by the DFT/IDFT pair between spectrum domain and cepstrum domain, since the spectrum is conjugate symmetric, periodic and all that. What we are left with, then, is DFT/IDFT in and out of time domain, right? Isn't that standard practice in speech processing? There ought to be starndard frameworks to handle that sort of stuff? Overlap-Add? Overlap-Save? Rune

Reply by ●March 10, 20072007-03-10

Hello Rune and Robert, Thanks very much for your replies. It seems my initial response to you also got lost, so I am posting again.> The short version is that the circular vs linear convolution issue comes up > whenever one uses the DFT/IDFT transform pair.I'm sorry, but I don't understand why circular convolution is an inescapable aspect of the DFT and IDFT?>> Instead, Audacity presumably either >> converts M(k) to the time domain and does linear convolution in the >> time domain, or uses zero padding so that multiplication in the DFT >> domain will correspond to linear convolution. (A quick look at the >> source code suggests it uses the latter approach.) > > sounds like it could be "fast convolution" which is something we > discuss here and is in the texts like O&S. but one thing to keep in > mind, not any arbitrary old function for M(k) can result in a short > enough FIR to fit in the length of the zero-padding you mention. only > a certain class of M(k) works, so it couldn't be both fast convolution > and "draw your own frequency response", unless there is a step in > there that time-limits the impulse response and ends up fudging your > M(k).What I had in mind was filtering L samples of audio at a time using overlap-add. The L audio samples would be padded with N-1 zeros, and then taken to the frequency domain. Similarly, the N-point M(k) would be taken to the time domain, padded with L-1 zeros, and then taken back to the frequency domain to obtain an (N+L-1)-point M(k). It seems to me that any N-point M(k) can be used in this way, and so the zero-padding in itself does not require time-limiting the impulse response or using only certain classes of M(k)? But maybe I have misunderstood you. I notice that there is Audacity code that time-limits the impulse response using a window function. And I also notice that this is recommended in chapter 17 of the online Scientist and Engineer's Guide to Digital Signal Processing (at http://www.dspguide.com/ch17/1.htm). The reason given there is that the frequency response of the filter is likely to be lousy if a window function is not used.> and then slowly slew the coefs to the linear FIR for each frame and > you don't have frame-edge effects (that you would get with circular > convolution).The mean is calculated over all frames, so the same mean is being removed from all frames. So I don't see what there is to slew between. Did I misunderstand something?>> One reason I suppose the proposed approach might be better is that the >> effect of the convolutional distortion can cross frame boundaries. > but not in your case since you're doing linear convolution.By 'convolutional distortion' I meant the distortion I want to remove which is due to telephone equipment, microphone, or room acoustics. In other words, the impulse response between the speech coming out of the mouth of the person talking and the data that ends up on the computer (I'm glossing over the analog-to-digital conversion for the sake of simplicity). So the effects of this can cross frame boundaries since it happens before framing, and attempting to deconvolve this by doing cepstral mean subtraction in the usual way is not ideal for undoing the cross-frame smearing of energy, since CMS amounts to a circular convolution which does not extend past frame boundaries.> but the frame boundaries happen at sorta random times from the POV of > the audio. this information is likely spread across these artifically > placed boundaries anyway.That's a good point; uniformly spaced frames (which are the usual technique) will often be out of synch with the boundaries between different speech sounds. However, a convolutional distortion can make the situation worse. A convolutional distortion makes the received speech signal at a given point in time have an extra dependency on the past speech. Since the exact form of this dependency depends on the impulse response of the distortion, and these impulse responses vary, the dependency may be hard for the speech recognition system to model. (Speech recognition systems do tend to have some ability to model context dependent effects. For example, it's common to model speech sounds with 'triphone' models in which the model for a particular speech sound can vary depending on the previous speech sound and the next speech sound. This is intended to reflect 'coarticulation', which is how the positions of the tongue and so on while a human produces a speech sound can be affected by the preceding and following speech sounds.) How hard to model, I'm not sure. Presumably, all else being equal, it will be harder if the convolutional distortion is longer. Some information comes from Takiguchi et al. (in ICASSP 2004 and also in a 2006 journal paper). They "consider the reflection signal of the reverberant speech as additive noise and approximate it by a linear prediction from the preceding frame", which, if I read their paper correctly, improved speech recognition word accuracy from 91.2% to 94.0%. This was using audio corresponding to a microphone 2 m from the person talking, which corresponds to a longer convolutional distortion than many speech recognition applications. (By "reverberant speech" they mean the speech affected by the convolutional distortion.) I don't see a way to rule out the possibility that their linear prediction approach is also helping with the modeling of coarticulation, so I'm not sure if their gain comes completely from better modeling of the convolutional distortion. Regards, David

Reply by ●March 11, 20072007-03-11

On 11 Mar, 00:02, "David Gelbart" <for-spam-o...@mailinator.com> wrote:> Hello Rune and Robert, > > Thanks very much for your replies. It seems my initial response to > you also got lost, so I am posting again. > > > The short version is that the circular vs linear convolution issue comes up > > whenever one uses the DFT/IDFT transform pair. > > I'm sorry, but I don't understand why circular convolution is an > inescapable aspect of the DFT and IDFT?The DFT is defined as the Fourier tranform of a sequence of N numbers. The basis functions of the transform are all on the general form e_k(n) = exp(j2*pi*k*n/N) k = 0,1,...,N-1 Now, if one lets n exceed N, say, n = q*N+m, m>0 one gets e'_k(q*N+m) = exp(j2*pi*k*(q*N+m)/N) = exp(j2*pi*k*q*N/N)*exp(j2*pi*m/N) = 1^q*exp(j2*pi*m/N) = exp(j2*pi*m/N) This is a basic propery of exponential functins and causes wrap-around if one does not take active steps to prevent it. As you know, linear convolution between one L-length sequence and a M-length sequence produces a sequence of length L+M-1. The only way to way to ensure the convolution in frequency domain to be linear, is to actively prevent the wrap-around as indicated above, from ever happening. The only way to ensure this, is to make sure the FFT length, N, is at least as large as the resulting sequence, N >= L+M-1. Everything is caused by the periodicity e'_k(q*N+m) = e_k(m) as indicated above. Rune

Reply by ●March 11, 20072007-03-11

On Mar 11, 11:04 am, "Rune Allnor" <all...@tele.ntnu.no> wrote:> On 11 Mar, 00:02, "David Gelbart" <for-spam-o...@mailinator.com> > wrote: >...> > > > The short version is that the circular vs linear convolution issue comes up > > > whenever one uses the DFT/IDFT transform pair. > > > I'm sorry, but I don't understand why circular convolution is an > > inescapable aspect of the DFT and IDFT?if Y[k] = H[k] * X[k] ("*" means multiply) then N-1 y[n] = SUM{ h[i] * x[n-i] } i=0 but, it must be understood that x[n+N] = x[n] for all integer n which results in y[n+N]=y[n] (for all n).> The DFT is defined as the Fourier tranform of a sequence > of N numbers. The basis functions of the transform are > all on the general form > > e_k(n) = exp(j2*pi*k*n/N) k = 0,1,...,N-1i would say that this basis applies for any contiguous set of N integers of k. my spin, which sometimes gets me into fights here, but i still hold to doggedly (*and* i believe is supported by anal-retentive texts like O&S), is that the DFT maps a periodic and discete-time function (with period N) to a periodic and discrete-frequency function having the same period. the DFT is precisely the same thing, from its mathematical definition to the applicable theorems, as the Discrete Fourier Series.> This is a basic propery of exponential functins and > causes wrap-around if one does not take active steps > to prevent it. > > As you know, linear convolution between one L-length > sequence and a M-length sequence produces a sequence > of length L+M-1. > > The only way to way to ensure the convolution in > frequency domain to be linear, is to actively prevent > the wrap-around as indicated above, from ever happening. > The only way to ensure this, is to make sure the FFT > length, N, is at least as large as the resulting > sequence, > > N >= L+M-1. > > Everything is caused by the periodicity > > e'_k(q*N+m) = e_k(m) > > as indicated above.yeah, what he said. but i have to read your (long) post a little more to understand fully what it is that is going on to comment further. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."

Reply by ●March 12, 20072007-03-12

Rune and Robert, thanks again. I haven't read your latest posts in full yet. I am posting to share comments from Dan Ellis about my original post. (By the way, his web page at www.ee.columbia.edu/~dpwe has an interesting collection of audio-related software and lecture slides.) From: Dan Ellis To: David Gelbart David - The point you raise is subtle and valid, but I don't believe it would have much effect on a task like speech recognition. The difference between circular and linear convolution relies on the extent to which the convolution of the two sequences exceeds the length of the basic FFT frame length i.e. if both signals are length N, then the time- aliasing involved in circular convolution could have a large impact. If, on the other hand, one of the signals is K << N points, then over N-K+1 of the resulting convolution, linear and circular convolution will match, and the discrepency is limited to K-1 points. Now, a short impulse response corresponds to a smooth frequency response (think, for instance, about windowing a longer time response, corresponding to convolving (smoothing) the spectrum with a broad window) - and thus multiplying the frequency response by a smooth function will be equivalent to convolving with a short impulse response (not always, but potentially - and particularly if the signal is zero phase i.e. the multiplication is by a pure-real, smooth function). A 256 point spectrum projected onto 13 Mel Cepstra will mostly be very smooth (except perhaps in the lowest frequency bands), so the effective impulse response is short. The other way to think about it is to think about applying a linear filter to a continuous signal, and how that would play out in the MFCCs. The obvious difference between linear and circular convolution here is the small spill-over from the points in one frame to the next frame due to the time-extent of the impulse response. So if you have one frame with energy, then the next frame is exactly zero, the effect of applying a (causal) filter will be to spill some energy into the otherwise silent frame. But in the absence of this kind of situation (where the energy of the spillover is significant in comparison to the energy within the frame), the effect of the linear filtering on the spectrum (and hence MFCCs) is pretty much what you expect - a frequency-dependent gain. Again, the difference between linear convolution and scaling FFT bins gets larger as the equivalent impulse response becomes larger compared to the FFT frame size. But this is, I think, rare or unlikely for a filter defined in the MFCC domain, because of the spectral smoothness implicit in a low-dimensional representation. DAn. [I then asked Dan whether he felt the same way about LDMN as about CMS.] From: Dan Ellis To: David Gelbart LDMN is removing the mean of each of 513 (or whatever) log spectral channels? In theory that could result in an effective filter that would differ significantly from linear filtering. But again, it's quite unlikely that a real signal would have such rapidly-varying statistics to result in such a filter. I think you can get around this by zero-padding all your frames prior to analysis. Doesn't force the effective filter to be N/2 long, but I doubt it wouldn't be.