Hi

I've been watching the lectures in the Coursera class Audio Signal Processing for Music Applications.

https://www.coursera.org/learn/audio-signal-processing?

I don't have much prior experience with DSP programming, or higher education in mathematics. The course explains how to, using windowed STFT, decompose a signal into a sinusoidal or harmonic part, and a residual or stochastic part, to make transformations on it. In one of the chapters, week 8, he demonstrates how Audacity uses this type of model to time stretch audio without changing the pitch. He explains that Audacity uses a sinusoidal model very similar to the one presented in the class, but that it splits the signal into octaves, and processes them individually. Quote:

"... the major difference, which is an important one, is that this sinusoidal model is based on a sub-band type of processing. So it does sinusoidal modeling by splitting the whole sound in octaves and modeling every octave with a different analysis, synthesis approach." - Week 8, Time Scaling, 3:55 - 4:20

So, how would one go about doing that? Suppose we make it easy by starting with only two bands. My first hunch was that one should low-pass filter the lowest band, and high-pass filter the higher band, in such a way that the two sum back to the original signal. This way, one would ensure that the modeled results of the higher band can be summed with the modeled results in the lower band without, for instance, losing or gaining energy.

My first naive approach, was to first apply a (Hamming) window to a segment of the sound, doing the FFT, then simply discard the upper FFT bins by setting them to zero. Then I tried processing the upper band the same way, discarding the lower bins. This probably would work if the window/hop sizes were the same in both bands (?) but to take advantage of this sort of parallel processing, one would probably do better to process the upper band with a shorter window, to get increased time resolution. Otherwise I don't see a point. So, I process the upper band in four times as many segments, (1/4 the window size) and discard the lower FFT bins. In the lower band I discard FFT bins starting with 32 and up, and in the higher band I discard bins [0-8). My "transformation" does nothing, it just converts from the time domain to the frequency domain and back, and then overlap-add to see if I get the same sound back.

The distortion I get is considerable, whereas if I process only one band I get distortion of only about -320 dBFS. So, before doing more experimentation I'd like to know what is the commonly accepted way of doing this type of sub-band splitting? I can imagine a few:

1) Do the HPF/LPF in the time domain, before applying the windows and FFT

2) Do the filtering in the frequency domain, after windowing and FFT, but more properly with a FIR filter

3) Don't do any filtering at all. Just make sure to match any sinusoid detected in the lower band with a sinusoid detected in the higher band, so you don't detect the same sinusoid twice.

4) Forget about FFT, use Wavelets instead.

I'm not sure if (2) would even work.... would it? What is the generally accepted approach, if any?

This is called mel scale SBC, and is used in many codecs.

The easiest way to implement it is using a tree structure. First divide your signal into 2 half-bands that sum to the original signal. Decimate by 2. Then repeat the process for the high band - dividing it into two bands.

Since Audacity is open source and the code is quite readable, I also encourage you to download the code and look for yourself. I did this for a different effect once and that helped a great deal.