# Convolution and cepstral mean subtraction

Started by March 9, 2007
```It seems to me that cepstral mean subtraction (CMS), which is commonly
performed in computer speech recognition system, corresponds to a
circular convolution.  I am curious what people think about how I'm
viewing this and about the possible effects of changing CMS to use a
linear convolution.

This post turned out very long, so if you're in a hurry please just
read from the fourth paragraph to the eighth paragraph.

I'll start by giving an example of what I consider to be standard
practice in DSP.  In the Audacity audio editor there is an
response of a filter with the mouse.  This gives a magnitude spectrum
M(k) with N points.  I don't suppose that Audacity processes the audio
by taking N points of audio at a time, calculating the DFT of those N
points, multiplying it by
M(k), and then going back to the time domain by  calculating the
IDFT.  That approach would result in a  circular convolution rather
than linear convolution.   Instead, Audacity presumably either
converts M(k) to the time domain and does linear convolution in the
time domain, or uses zero padding so  that multiplication in the DFT
domain will correspond to linear convolution.  (A quick look at the
source code suggests it uses the latter approach.)

Computer speech recognition systems normally window the signal (e.g.,
with 25 millisecond Hamming windows) to obtain a sequence of frames,
and calculate a feature vector (i.e., a signal description) for each
frame.  Usually, the feature vector is made of Mel-Frequency Cepstral
Coefficients (MFCCs), which are a type of smoothed spectral
representation.   It is common to process the MFCCs using CMS to
reduce the effects of convolutional distortions.  (I'll explain the
idea behind CMS at the bottom of this message, for those who are not
familiar with it.)

To simplify the discussion, I'll pretend CMS is done before the DCT
stage of MFCC calculation, instead of after the DCT.  (I guess this
does not affect the final values of the "static" MFCC features, due to
the linearity of the DCT.  I guess that's also true for the "delta"
features, but I'm less sure about those, so I'll just ignore them in
this discussion.)  In this case, each feature vector contains 23 (or
so) logarithmic, Mel-warped spectral magnitude values.  (The Mel-
warping is a warping of the frequency axis.) To perform CMS, we find
the mean M of the feature vectors, and then subtract M from each
feature vector.  This is meant to remove the effect of a
fixed (i.e., not time varying) convolutional distortion. The
distortion could come from a microphone, a telephone connection, room
acoustics, and so on.  The subtraction
of M from all the logarithmic spectral vectors is equivalent to
dividing the  corresponding linear spectral vectors by exp(M), and
that is equivalent to multiplying the linear spectral vectors by 1/
exp(M).  So it seems that an equivalent way to look
at CMS is that we design a 23-point filter 1/exp(M), and then apply it
to the linear spectral vectors by multiplying the 23 data points from
each of those vectors by the 23 filter points.   But that corresponds
to circular convolution, right?

So I'm wondering: how would the effect of the processing be different
if we instead transformed 1/exp(M) to the time domain, did a linear
convolution of the input waveform with 1/exp(M), and then calculated
our final MFCCs using the output of that linear convolution?   I'm
curious what people think.

One reason I suppose the proposed approach might be better is that the
effect of the convolutional distortion can cross frame boundaries.
If a phoneme (speech sound) has length L_p, and the convolution
distortion has length L_c, then the convolution of the two will have
length L_c + L_p - 1.  Thus the convolution can cause speech
information that was previously confined to one frame to spread across
a frame boundary.   When we divide by  1/exp(M), we are trying to undo
the effects of the convolutional distortion, but we are dividing each
frame by 1/exp(M) separately, so there is no way to push energy back
from a frame to the previous frame.

One problem with the proposed approach is that we don't have a phase
spectrum for exp(M) to use when transforming exp(M) to the time
domain.  I hope assigning a minimum-phase phase spectrum to exp(M)
would be reasonable, because  min phase corresponds to minimum energy
delay, and I guess many convolutional channels will have impulse
responses which start strong and then decay quickly.  Given a
magnitude spectrum,  a minimum-phase phase spectrum can be computed
quickly.  (I have Matlab code for it.)    If you have another idea for
the phase spectrum, I would like to hear about it.

Alternatively, since the MFCC values are based only on the magnitude
spectrum of the framed data and  not on the phase spectrum of the
framed data, maybe there is a way to find the MFCC values resulting
from linear convolution without needing to assign a phase spectrum to
exp(M)?  If this is possible, I suppose it would be better than
assuming a min-phase phase spectrum.

When I wrote, "But that corresponds to circular convolution, right?",
I glossed over the effect of the Mel-warping.  I don't know if or how
the Mel-warping interacts with whether or not circular convolution
occurs.    However, in any case, it's possible to do something very
similar to CMS before the Mel warping, by doing the mean calculation
and subtraction with a logarithmic, unwarped magnitude spectra instead
of a
logarithmic, Mel-warped magnitude spectra.  This alternative to CMS
has been named log-DFT mean normalization (LDMN) and was discussed in
C. Avendano and H. Hermansky 's paper "On the Effects of Short-Term
Spectrum Smoothing in Channel Normalization" (IEEE Transactions on
Speech and Audio Processing, 1997; paper "avendano97c" at
http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all)
and Neumeyer et al.'s paper "Training Issues and Channel Equalization
Techniques for the Construction of Telephone Acoustic Models Using a
High-Quality Speech Corpus" (IEEE Transactions on Speech and Audio
Processing, 1994; http://citeseer.ist.psu.edu/278439.html).
In fact, I have unpublished results showing LDMN sometimes performs
better than CMS, although in most of my tests it did not  make a
statistically significant difference  whether I did LDMN or did
CMS.    (By the way, the reason I did not publish those results is
because I felt I did not have enough comparisons to make a good
with me if you are interested in trying to co-author a good paper by
running more LDMN and CMS comparisons to augment the ones I've already
done.)

If doing the mean subtraction before the Mel-warping is not
attractive, I note that it is possible to transform a Mel-warped
spectrum to the time domain.  I have seen this done as part of the Mel-
warped Wiener filter in Agarwal and Cheng's ASRU 99 paper and in the
Report_WorkItem.asp?WKI_ID=25817).

David Gelbart

PS Here is the explanation of CMS that I promised.  I'll ignore the
Mel-warping and the DCT, so this will be a discussion of LDMN, not
CMS.  But the same ideas carry over to CMS (although, as discussed in
the papers on LDMN cited above,  doing the subtraction after the Mel-
warping seems to require making additional assumptions).

Say we have a time-domain signal y which is the convolution of a
speech signal x and a convolutional distortion c.  We window our
signal (e.g., with a Hamming window) to obtain a sequence of frames.
Let's assume that c does not change from frame to frame.

Let's further assume that, therefore, in the DFT domain, we have |
Y(n,k)| = |X(n,k) C(k)| where n is the frame index and k is the
frequency bin.   (By the way, this assumption that convolution becomes
multiplication is inaccurate, because it ignores the effect that the
window  function used for framing has on Y.  This is discussed in
Avendano's PhD thesis, which is available as paper "avendano97a" a
http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all.
But this assumption is useful for deriving LDMN and CMS.  The
assumption will be more accurate when the window function is long
compared to the length of c(n).)

Thus, log|Y(n,k)| = log|X(n,k)| + log|C(k)|. Therefore the mean over n
of log|Y(n,k)| is M(k) = F(k) + log|C(k)|, where F(k) =
mean_over_n( log|X(n,k)| ).

If we subtract M(k) from log|Y(n,k)|, the
new frames will have the value Z(n,k) = log|Y(n,k)| - M(k) =  log |
X(n,k)| - F(k).  So we have removed the effect of the convolutional
distortion c.  We have also removed F(k), which wasn't our goal, but
if there are enough frames then F(k) will be a long-term average that
does not carry much information about the words that were spoken.  I
suppose that we can, if we like, view F as corresponding to  a
hypothetical convolutional distortion applied  to a hypothetical
speech signal z which corresponds to Z(n,k).

```
```Sorry, the URL http://www.bme.ogi.edu/~hynek/cgibin/publications/showbib_asp.pl?all
should have been
http://www.bme.ogi.edu/~hynek/cgi-bin/publications/showbib_asp.pl?all
both times it occurred.   That's the URL for the Avendano papers that
I referenced.  A shorter URL, http://www.bme.ogi.edu/~hynek/publications,
also works.

```
```On 10 Mar, 00:16, "David Gelbart" <for-spam-o...@mailinator.com>
wrote:
> It seems to me that cepstral mean subtraction (CMS), which is commonly
> performed in computer speech recognition system, corresponds to a
> circular convolution.

I wrote a lengthy response this morning, but it
seems it might have become lost somewhere.

The short version is that the circular vs linear
convolution issue comes up whenever one uses the
DFT/IDFT transform pair. If circular convolution
is a concern, then the user has two choises:

1) Not use the DFT/IDFT pair
2) Deal with the issues that arise when using
the DFT/IDFT pair.

As it happens, it is impossible to compute the
cepstrum without using the DFT/IDFT pair -- twice,
actually -- so one has no choise but to identify
any problems caused by circular convolution and
then deal with them accordingly.

Rune

```
```On Mar 9, 6:16 pm, "David Gelbart" <for-spam-o...@mailinator.com>
wrote:
> It seems to me that cepstral mean subtraction (CMS), which is commonly
> performed in computer speech recognition system, corresponds to a
> circular convolution.  I am curious what people think about how I'm
> viewing this and about the possible effects of changing CMS to use a
> linear convolution.
>
> This post turned out very long, so if you're in a hurry please just
> read from the fourth paragraph to the eighth paragraph.
>
> I'll start by giving an example of what I consider to be standard
> practice in DSP.  In the Audacity audio editor there is an
> Equalization filter dialog in which the user can draw the magnitude
> response of a filter with the mouse.  This gives a magnitude spectrum
> M(k) with N points.  I don't suppose that Audacity processes the audio
> by taking N points of audio at a time, calculating the DFT of those N
> points, multiplying it by
> M(k), and then going back to the time domain by  calculating the
> IDFT.  That approach would result in a  circular convolution rather
> than linear convolution.   Instead, Audacity presumably either
> converts M(k) to the time domain and does linear convolution in the
> time domain, or uses zero padding so  that multiplication in the DFT
> domain will correspond to linear convolution.  (A quick look at the
> source code suggests it uses the latter approach.)

sounds like it could be "fast convolution" which is something we
discuss here and is in the texts like O&S.  but one thing to keep in
mind, not any arbitrary old function for M(k) can result in a short
enough FIR to fit in the length of the zero-padding you mention.  only
a certain class of M(k) works, so it couldn't be both fast convolution
and "draw your own frequency response", unless there is a step in
there that time-limits the impulse response and ends up fudging your
M(k).

> Computer speech recognition systems normally window the signal (e.g.,
> with 25 millisecond Hamming windows) to obtain a sequence of frames,
> and calculate a feature vector (i.e., a signal description) for each
> frame.  Usually, the feature vector is made of Mel-Frequency Cepstral
> Coefficients (MFCCs), which are a type of smoothed spectral
> representation.   It is common to process the MFCCs using CMS to
> reduce the effects of convolutional distortions.  (I'll explain the
> idea behind CMS at the bottom of this message, for those who are not
> familiar with it.)
>
> To simplify the discussion, I'll pretend CMS is done before the DCT
> stage of MFCC calculation, instead of after the DCT.  (I guess this
> does not affect the final values of the "static" MFCC features, due to
> the linearity of the DCT.  I guess that's also true for the "delta"
> features, but I'm less sure about those, so I'll just ignore them in
> this discussion.)  In this case, each feature vector contains 23 (or
> so) logarithmic, Mel-warped spectral magnitude values.  (The Mel-
> warping is a warping of the frequency axis.) To perform CMS, we find
> the mean M of the feature vectors, and then subtract M from each
> feature vector.  This is meant to remove the effect of a
> fixed (i.e., not time varying) convolutional distortion. The
> distortion could come from a microphone, a telephone connection, room
> acoustics, and so on.  The subtraction
> of M from all the logarithmic spectral vectors is equivalent to
> dividing the  corresponding linear spectral vectors by exp(M), and
> that is equivalent to multiplying the linear spectral vectors by 1/
> exp(M).  So it seems that an equivalent way to look
> at CMS is that we design a 23-point filter 1/exp(M), and then apply it
> to the linear spectral vectors by multiplying the 23 data points from
> each of those vectors by the 23 filter points.   But that corresponds
> to circular convolution, right?
>
> So I'm wondering: how would the effect of the processing be different
> if we instead transformed 1/exp(M) to the time domain, did a linear
> convolution of the input waveform with 1/exp(M), and then calculated
> our final MFCCs using the output of that linear convolution?   I'm
> curious what people think.

and then slowly slew the coefs to the linear FIR for each frame and
you don't have frame-edge effects (that you would get with circular
convolution).

> One reason I suppose the proposed approach might be better is that the
> effect of the convolutional distortion can cross frame boundaries.

but not in your case since you're doing linear convolution.

> If a phoneme (speech sound) has length L_p, and the convolution
> distortion has length L_c, then the convolution of the two will have
> length L_c + L_p - 1.  Thus the convolution can cause speech
> information that was previously confined to one frame to spread across
> a frame boundary.

but the frame boundaries happen at sorta random times from the POV of
the audio.  this information is likely spread across these artifically
placed boundaries anyway.  using uniformly-spaced frames is like
uniform sampling of the parameters (like the magnitudes of spectral
components), so those parameters aren't allowed to change very rapidly
relative to the frame rate lest it not be fast enough.

there is more to your post to read.  feel free to restate whatever i'm
missing ahead.  i need more time to read and to "get" it.

```
```On 10 Mar, 07:54, "robert bristow-johnson" <r...@audioimagination.com>
wrote:
> On Mar 9, 6:16 pm, "David Gelbart" <for-spam-o...@mailinator.com>

> > Thus the convolution can cause speech
> > information that was previously confined to one frame to spread across
> > a frame boundary.
>
> but the frame boundaries happen at sorta random times from the POV of
> the audio.  this information is likely spread across these artifically
> placed boundaries anyway.

If you do cepstrum processing, you have no choise
but to use the DFT/IDFT pair to get between time
domain and spectrum domain. I am not too worried
about circular convolution caused by the DFT/IDFT
pair between spectrum domain and cepstrum domain,
since the spectrum is conjugate symmetric, periodic
and all that.

What we are left with, then, is DFT/IDFT in and
out of time domain, right? Isn't that standard
practice in speech processing? There ought to
be starndard frameworks to handle that sort

Rune

```
```Hello Rune and Robert,

Thanks very much for your replies.  It seems my initial response to
you also got lost, so I am posting again.

> The short version is that the circular vs linear convolution issue comes up
> whenever one uses the DFT/IDFT transform pair.

I'm sorry, but I don't understand why circular convolution is an
inescapable aspect of the DFT and IDFT?

>> converts M(k) to the time domain and does linear convolution in the
>> time domain, or uses zero padding so  that multiplication in the DFT
>> domain will correspond to linear convolution.  (A quick look at the
>> source code suggests it uses the latter approach.)
>
> sounds like it could be "fast convolution" which is something we
> discuss here and is in the texts like O&S.  but one thing to keep in
> mind, not any arbitrary old function for M(k) can result in a short
> enough FIR to fit in the length of the zero-padding you mention.  only
> a certain class of M(k) works, so it couldn't be both fast convolution
> and "draw your own frequency response", unless there is a step in
> there that time-limits the impulse response and ends up fudging your
> M(k).

What I had in mind was filtering L samples of audio at a time using
overlap-add.  The L audio samples would be padded with N-1 zeros, and
then taken to  the frequency domain.  Similarly, the N-point M(k)
would be taken to the time domain, padded with L-1 zeros, and then
taken back to the frequency domain to obtain an (N+L-1)-point M(k).
It seems to me that any N-point M(k) can be used in this way, and so
the zero-padding in itself does not require time-limiting the impulse
response or using only certain classes of M(k)?

But maybe I have misunderstood you.  I notice that there is Audacity
code that time-limits the impulse response using a window function.
And I also notice that this is recommended in chapter 17 of the online
Scientist and Engineer's Guide to Digital Signal Processing (at
http://www.dspguide.com/ch17/1.htm).   The reason given there is that
the frequency response of the filter is likely to be lousy if a window
function is not used.

> and then slowly slew the coefs to the linear FIR for each frame and
> you don't have frame-edge effects (that you would get with circular
> convolution).

The mean is calculated over all frames, so the same mean is being
removed from all frames.  So I don't see what there is to slew
between.  Did I misunderstand something?

>> One reason I suppose the proposed approach might be better is that the
>> effect of the convolutional distortion can cross frame boundaries.
> but not in your case since you're doing linear convolution.

By 'convolutional distortion' I meant the distortion I want to remove
which is due to telephone equipment, microphone, or room acoustics.
In other words, the impulse response between the speech coming out of
the mouth of the person talking and the data that ends up on the
computer (I'm glossing over the analog-to-digital conversion for the
sake of simplicity).  So the effects of this can cross frame
boundaries since it happens before framing, and attempting to
deconvolve this by doing cepstral mean subtraction in the usual way is
not ideal for undoing the cross-frame smearing of energy, since CMS
amounts to a circular convolution which does not extend past frame
boundaries.

> but the frame boundaries happen at sorta random times from the POV of
> the audio.  this information is likely spread across these artifically
> placed boundaries anyway.

That's a good point; uniformly spaced frames (which are the usual
technique) will often be out of synch with the boundaries between
different speech sounds.

However, a convolutional distortion can make the situation worse.  A
convolutional distortion makes the received speech signal at a given
point in time have an extra dependency on the past speech.  Since the
exact form of this dependency depends on the impulse response of the
distortion, and these impulse responses vary, the dependency may be
hard for the speech recognition system to model.  (Speech recognition
systems do tend to have some ability to model context dependent
effects.  For example, it's common to model speech sounds with
'triphone' models in which the model for a particular speech sound can
vary depending on the previous speech sound and the next speech
sound.  This is intended to reflect 'coarticulation', which is how the
positions of the tongue and so on while a human produces a speech
sound can be affected by the preceding and following speech sounds.)

How hard to model, I'm not sure.  Presumably, all else being equal, it
will be harder if the convolutional distortion is longer.   Some
information comes from Takiguchi et al. (in ICASSP 2004 and also in a
2006 journal paper). They "consider the reflection signal of the
reverberant speech as additive noise and approximate it by a linear
prediction from the preceding frame", which, if I read their paper
correctly, improved speech recognition word accuracy from 91.2% to
94.0%.  This was using audio corresponding to a microphone 2 m from
the person talking, which corresponds to a longer convolutional
distortion than many speech recognition applications.  (By
"reverberant speech" they mean the speech affected by the
convolutional distortion.)  I don't see a way to rule out the
possibility that their linear prediction approach is also helping with
the modeling of coarticulation, so I'm not sure if their gain comes
completely from better modeling of the convolutional distortion.

Regards,
David

```
```On 11 Mar, 00:02, "David Gelbart" <for-spam-o...@mailinator.com>
wrote:
> Hello Rune and Robert,
>
> Thanks very much for your replies.  It seems my initial response to
> you also got lost, so I am posting again.
>
> > The short version is that the circular vs linear convolution issue comes up
> > whenever one uses the DFT/IDFT transform pair.
>
> I'm sorry, but I don't understand why circular convolution is an
> inescapable aspect of the DFT and IDFT?

The DFT is defined as the Fourier tranform of a sequence
of N numbers. The basis functions of the transform are
all on the general form

e_k(n) = exp(j2*pi*k*n/N)       k = 0,1,...,N-1

Now, if one lets n exceed N, say, n = q*N+m, m>0
one gets

e'_k(q*N+m) = exp(j2*pi*k*(q*N+m)/N)
= exp(j2*pi*k*q*N/N)*exp(j2*pi*m/N)
= 1^q*exp(j2*pi*m/N)
= exp(j2*pi*m/N)

This is a basic propery of exponential functins and
causes wrap-around if one does not take active steps
to prevent it.

As you know, linear convolution between one L-length
sequence and a M-length sequence produces a sequence
of length L+M-1.

The only way to way to ensure the convolution in
frequency domain to be linear, is to actively prevent
the wrap-around as indicated above, from ever happening.
The only way to ensure this, is to make sure the FFT
length, N, is at least as large as the resulting
sequence,

N >= L+M-1.

Everything is caused by the periodicity

e'_k(q*N+m) = e_k(m)

as indicated above.

Rune

```
```On Mar 11, 11:04 am, "Rune Allnor" <all...@tele.ntnu.no> wrote:
> On 11 Mar, 00:02, "David Gelbart" <for-spam-o...@mailinator.com>
> wrote:
>
...
>
> > > The short version is that the circular vs linear convolution issue comes up
> > > whenever one uses the DFT/IDFT transform pair.
>
> > I'm sorry, but I don't understand why circular convolution is an
> > inescapable aspect of the DFT and IDFT?

if   Y[k] = H[k] * X[k]    ("*" means multiply)

then
N-1
y[n] = SUM{ h[i] * x[n-i] }
i=0

but, it must be understood that x[n+N] = x[n]  for all integer n which
results in y[n+N]=y[n] (for all n).

> The DFT is defined as the Fourier tranform of a sequence
> of N numbers. The basis functions of the transform are
> all on the general form
>
> e_k(n) = exp(j2*pi*k*n/N)       k = 0,1,...,N-1

i would say that this basis applies for any contiguous set of N
integers of k.

my spin, which sometimes gets me into fights here, but i still hold to
doggedly (*and* i believe is supported by anal-retentive texts like
O&S), is that the DFT maps a periodic and discete-time function (with
period N) to a periodic and discrete-frequency function having the
same period.  the DFT is precisely the same thing, from its
mathematical definition to the applicable theorems, as the Discrete
Fourier Series.

> This is a basic propery of exponential functins and
> causes wrap-around if one does not take active steps
> to prevent it.
>
> As you know, linear convolution between one L-length
> sequence and a M-length sequence produces a sequence
> of length L+M-1.
>
> The only way to way to ensure the convolution in
> frequency domain to be linear, is to actively prevent
> the wrap-around as indicated above, from ever happening.
> The only way to ensure this, is to make sure the FFT
> length, N, is at least as large as the resulting
> sequence,
>
>   N >= L+M-1.
>
> Everything is caused by the periodicity
>
>  e'_k(q*N+m) = e_k(m)
>
> as indicated above.

yeah, what he said.

but i have to read your (long) post a little more to understand fully
what it is that is going on to comment further.

--

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

```
```Rune and Robert, thanks again.  I haven't read your latest posts in
full yet.  I am posting to share comments from Dan Ellis about my
original post.   (By the way, his web page at  www.ee.columbia.edu/~dpwe
has an interesting collection of audio-related software and lecture
slides.)

From: Dan Ellis
To: David Gelbart

David -

The point you raise is subtle and valid, but I don't believe it would
have much effect on a task like speech recognition.  The difference
between circular and linear convolution relies on the extent to which
the convolution of the two sequences exceeds the length of the basic
FFT frame length i.e. if both signals are length N, then the time-
aliasing involved in circular convolution could have a large impact.
If, on the other hand, one of the signals is K << N points, then over
N-K+1 of the resulting convolution, linear and circular convolution
will match, and the discrepency is limited to K-1 points. Now, a short
impulse response corresponds to a smooth frequency response (think,
for instance, about windowing a longer time response, corresponding to
convolving (smoothing) the spectrum with a broad window) - and thus
multiplying the frequency response by a smooth function will be
equivalent to convolving with a short impulse response (not always,
but potentially - and particularly if the signal
is zero phase i.e. the multiplication is by a pure-real, smooth
function).  A 256 point spectrum projected onto 13 Mel Cepstra will
mostly be very smooth (except perhaps in the lowest frequency bands),
so the effective impulse response is short.

The other way to think about it is to think about applying a linear
filter to a continuous signal, and how that would play out in the
MFCCs.  The obvious difference between linear and circular convolution
here is the small spill-over from the points in one frame to the next
frame due to the time-extent of the impulse response.  So if you have
one frame with energy, then the next frame is exactly zero, the effect
of applying a (causal) filter will be to spill some energy into the
otherwise silent frame.  But in the absence of this kind of situation
(where the energy of the spillover is significant in comparison to the
energy within the frame), the effect of the linear filtering on the
spectrum (and hence MFCCs) is pretty much what you expect - a
frequency-dependent gain.  Again, the difference between linear
convolution and scaling FFT bins gets larger as the equivalent impulse
response becomes larger compared to the FFT frame size.  But this is,
I think, rare or unlikely for a filter defined in the MFCC domain,
because of the spectral smoothness implicit in a low-dimensional
representation.

DAn.

CMS.]

From: Dan Ellis
To: David Gelbart

LDMN is removing the mean of each of 513 (or whatever) log spectral
channels? In theory that could result in an effective filter that
would differ significantly from linear filtering.  But again, it's
quite unlikely that a real signal would have such rapidly-varying
statistics to result in such a filter.

I think you can get around this by zero-padding all your frames prior
to analysis.  Doesn't force the effective filter to be N/2 long, but I
doubt it wouldn't be.

```