comp.dsp | time stretching for dictation

I am a windows programmer developing an application for medical
transcription. I have an audio player/recorder that supports windows
wav files (PCM). I am looking for help doing time stretching, or
simply put, slowing down the playback without changing the pitch, so
that the transcriptionist can more easily transcribe the dictation.

I've been looking in Google for a day now - I found lot's of brief
descriptions on granular synthesis and Fourier Transforms.
Unfortunately, I'm not a "dsp guy" an a lot of this went over my head.

I also downloaded the sourcecode for the CSound application, but I've
been unable to decipher just where the granular synthesis is being
done, nor do I think I could figure it out because it isn't documented
very well.

Can anyone help by posting a "simple" algorithm (or explanation) for
granular synthesis of PCM data? Source code would be great too!

Reply by robert bristow-johnson ●January 5, 20042004-01-05

In article b4c52ba8.0401041200.259077d1@posting.google.com, George Kustas at
gkustas@hvc.rr.com wrote on 01/04/2004 15:00:

> I am a windows programmer developing an application for medical
> transcription. I have an audio player/recorder that supports windows
> wav files (PCM). I am looking for help doing time stretching, or
> simply put, slowing down the playback without changing the pitch, so
> that the transcriptionist can more easily transcribe the dictation.
> 
> I've been looking in Google for a day now - I found lot's of brief
> descriptions on granular synthesis and Fourier Transforms.
> Unfortunately, I'm not a "dsp guy" an a lot of this went over my head.
> 
> I also downloaded the sourcecode for the CSound application, but I've
> been unable to decipher just where the granular synthesis is being
> done, nor do I think I could figure it out because it isn't documented
> very well.
> 
> Can anyone help by posting a "simple" algorithm (or explanation) for
> granular synthesis of PCM data? Source code would be great too!

i am not sure that you would want exactly "granular synthesis" but instead a
simple variant of the PSOLA (pitch synchronous overlap add) method or
sometimes called WSOLA (windowed synchronous overlap add) method.  the
concept is pretty simple:

given your input x[n], the output is a sum of delayed and cross-faded (the
"window" would be the concatenation of the fade-up and fade-down functions
applied to the adjacent sections of x[n]) sections of the input:

assuming identical length of fade-up and fade-down of N samples (so the
non-zero length of the window is 2*N-1 samples), the window function would
be w[n].  for example, if linear cross-fading was used, w[n] would be an
even symmetric triangular function that would be:


           { 1 - |n|/N      for  -N < n < N
    w[n] = {
           {    0           otherwise


the output would look like:

     y[n] = x[n]*w[n]  +  x[n-O(1)]*w[n-N]  +  x[n-O(2)]*w[n-2N]
                                                 +  x[n-O(3)]*w[n-3N]  + ...


    or, more generally:

    y[n] = SUM{ x[n-O(k)]*w[n-k*N] }
            k

where O(k) is the offset of the k_th segment to accomplish slowing down the
sound.  some people will call that segment, x[n-O(k)]*w[n-k*N], the k_th
"grain" but i am not sure i would.  only two windows will be non-zero at any
one time so you are really only mixing 2 windowed input segments.

if "alpha" is your time stretching factor (if you're stretching, alpha>1,
and if you're time-compressing, alpha<1) then

    n/alpha - P/2 <  n - O(k) < n/alpha + P/2

    for  k*N - N/2 <= n < k*N + N/2

where P is the maximum period of the input x[n] around the time of n/alpha.
the latter inequality is the same as

    k - 1/2 <= n/N < k + 1/2

which is the same as

    k = round(n/N)  =   floor(n/N + 1/2)

so     n/alpha - P/2 <  n - O(k) < n/alpha + P/2     is the same as

    |(1-1/alpha)*n - O(round(n/N))| < P/2

for all n.


the offset O(k) is an recursive result such that

    O(0) = 0

and the average magnitude difference function,

                (k+1)*N-1
    AMDF(k+1) =    SUM{    |x[n-O(k)] - x[n-O(k+1)]|  }
                  n=k*N

is minimum but still satisfies n/alpha - P/2 <  n - O(k+1) < n/alpha + P/2 .

try to detangle what i wrote and try it.  for monophonic dialog (only one
voice speaking), it should work pretty well.

r b-j

Reply by Jim Adamthwaite ●January 5, 20042004-01-05

Something which may help is the "Amazing Slow Downer" from Roni Music.  I
downloaded an evaluation copy (only slightly crippled) and was quite
impressed with it as a means of learning fast musical passages by ear.

They are contactable at http://www.ronimusic.com and info@ronimusic.com, or
at least they were 6 months ago.

Jim Adamthwaite.

Reply by George Kustas ●January 5, 20042004-01-05

> i am not sure that you would want exactly "granular synthesis" but instead a
> simple variant of the PSOLA (pitch synchronous overlap add) method or
> sometimes called WSOLA (windowed synchronous overlap add) method.

That's fine... as you say down below, monophonic voice is all I'm
using, so whatever the simplest solution is...

> given your input x[n], the output is a sum of delayed and cross-faded (the
> "window" would be the concatenation of the fade-up and fade-down functions
> applied to the adjacent sections of x[n]) sections of the input:

OK, you lost me already - sorry! What does "fading" mean? I understand
that my PCM data is simply integers representing amplitude. I
understand that I can determine the pitch by seeking the second
largest (or smallest) value, because this would be one "cycle" or
complete wave (am I using the right terminology?), or the "frequency"
of the waves.

Forgive me, I've only had jr. high physics, but is this windowing you
are describing have something to do with repeating cycles as I've
described them above? As I see it, I can preserve the pitch if I
repeat cycles, because the frequency should remain the same, no?

Reply by Cliff Chase ●January 6, 20042004-01-06

gkustas@hvc.rr.com (George Kustas) wrote in message news:<b4c52ba8.0401050923.2efc875b@posting.google.com>...
> > i am not sure that you would want exactly "granular synthesis" but instead a
> > simple variant of the PSOLA (pitch synchronous overlap add) method or
> > sometimes called WSOLA (windowed synchronous overlap add) method.
> 
> That's fine... as you say down below, monophonic voice is all I'm
> using, so whatever the simplest solution is...
> 
> > given your input x[n], the output is a sum of delayed and cross-faded (the
> > "window" would be the concatenation of the fade-up and fade-down functions
> > applied to the adjacent sections of x[n]) sections of the input:
> 
> OK, you lost me already - sorry! What does "fading" mean? I understand
> that my PCM data is simply integers representing amplitude. I
> understand that I can determine the pitch by seeking the second
> largest (or smallest) value, because this would be one "cycle" or
> complete wave (am I using the right terminology?), or the "frequency"
> of the waves.
> 
> Forgive me, I've only had jr. high physics, but is this windowing you
> are describing have something to do with repeating cycles as I've
> described them above? As I see it, I can preserve the pitch if I
> repeat cycles, because the frequency should remain the same, no?

Your term granular synthesis is actually somewhat correct in that
PSOLA is in essence a subset of granular synthesis.  That aside, if
you've only had Jr. High Physics I think that this task is beyond you.
 What RBJ is describing is a method whereby the PCM data is read out
of a circular buffer at a different rate than it is written in.  Since
the rates are different you actually need two read pointers and you
fade between the pointers.  The term "fade" describes changing the
volume as you switch from one pointer to the next.

Also, your pitch estimation technique won't work.  Typical methods of
pitch estimation use autocorrelation, AMDF, or more exotic methods
like wavelet transforms.  You need to know the pitch for PSOLA to work
properly.

What you are trying to do is a mid-level DSP problem.  This kind of
thing is difficult for those with Bachelor's degrees and I would think
nearly impossible for someone with no formal DSP training.

CC

Reply by Jerry Avins ●January 6, 20042004-01-06

Cliff Chase wrote:

> gkustas@hvc.rr.com (George Kustas) wrote in message news:<b4c52ba8.0401050923.2efc875b@posting.google.com>...
> 
>>>i am not sure that you would want exactly "granular synthesis" but instead a
>>>simple variant of the PSOLA (pitch synchronous overlap add) method or
>>>sometimes called WSOLA (windowed synchronous overlap add) method.
>>
>>That's fine... as you say down below, monophonic voice is all I'm
>>using, so whatever the simplest solution is...
>>
>>
>>>given your input x[n], the output is a sum of delayed and cross-faded (the
>>>"window" would be the concatenation of the fade-up and fade-down functions
>>>applied to the adjacent sections of x[n]) sections of the input:
>>
>>OK, you lost me already - sorry! What does "fading" mean? I understand
>>that my PCM data is simply integers representing amplitude. I
>>understand that I can determine the pitch by seeking the second
>>largest (or smallest) value, because this would be one "cycle" or
>>complete wave (am I using the right terminology?), or the "frequency"
>>of the waves.
>>
>>Forgive me, I've only had jr. high physics, but is this windowing you
>>are describing have something to do with repeating cycles as I've
>>described them above? As I see it, I can preserve the pitch if I
>>repeat cycles, because the frequency should remain the same, no?
> 
> 
> Your term granular synthesis is actually somewhat correct in that
> PSOLA is in essence a subset of granular synthesis.  That aside, if
> you've only had Jr. High Physics I think that this task is beyond you.
>  What RBJ is describing is a method whereby the PCM data is read out
> of a circular buffer at a different rate than it is written in.  Since
> the rates are different you actually need two read pointers and you
> fade between the pointers.  The term "fade" describes changing the
> volume as you switch from one pointer to the next.
> 
> Also, your pitch estimation technique won't work.  Typical methods of
> pitch estimation use autocorrelation, AMDF, or more exotic methods
> like wavelet transforms.  You need to know the pitch for PSOLA to work
> properly.
> 
> What you are trying to do is a mid-level DSP problem.  This kind of
> thing is difficult for those with Bachelor's degrees and I would think
> nearly impossible for someone with no formal DSP training.
> 
> CC

You're probably right, but it isn't absolute. A guy with no high-school
physics at all taught me to grind and figure telescope mirrors and how
to find Aldebaran.

Jerry
-- 
Engineering is the art of making what you want from things you can get.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by Richard Owlett ●January 6, 20042004-01-06

Given an audio file in WAV format
If it is desired to REDUCE SPEECH rate

Where is faulr in following senario

ASSUME speech rate to be modified by factor "K"

Read original saving sample rate
Do FFT of original
(multiply/divide) each bin by "K"
Do INVERSE FFT
Save to WAV file (multiplying/dividing) sample rate by "K"

[I may have inverted 1 too many times,   but you get the idea ;]


Cliff Chase wrote:

> gkustas@hvc.rr.com (George Kustas) wrote in message news:<b4c52ba8.0401050923.2efc875b@posting.google.com>...
> 
>>>i am not sure that you would want exactly "granular synthesis" but instead a
>>>simple variant of the PSOLA (pitch synchronous overlap add) method or
>>>sometimes called WSOLA (windowed synchronous overlap add) method.
>>
>>That's fine... as you say down below, monophonic voice is all I'm
>>using, so whatever the simplest solution is...
>>
>>
>>>given your input x[n], the output is a sum of delayed and cross-faded (the
>>>"window" would be the concatenation of the fade-up and fade-down functions
>>>applied to the adjacent sections of x[n]) sections of the input:
>>
>>OK, you lost me already - sorry! What does "fading" mean? I understand
>>that my PCM data is simply integers representing amplitude. I
>>understand that I can determine the pitch by seeking the second
>>largest (or smallest) value, because this would be one "cycle" or
>>complete wave (am I using the right terminology?), or the "frequency"
>>of the waves.
>>
>>Forgive me, I've only had jr. high physics, but is this windowing you
>>are describing have something to do with repeating cycles as I've
>>described them above? As I see it, I can preserve the pitch if I
>>repeat cycles, because the frequency should remain the same, no?
> 
> 
> Your term granular synthesis is actually somewhat correct in that
> PSOLA is in essence a subset of granular synthesis.  That aside, if
> you've only had Jr. High Physics I think that this task is beyond you.
>  What RBJ is describing is a method whereby the PCM data is read out
> of a circular buffer at a different rate than it is written in.  Since
> the rates are different you actually need two read pointers and you
> fade between the pointers.  The term "fade" describes changing the
> volume as you switch from one pointer to the next.
> 
> Also, your pitch estimation technique won't work.  Typical methods of
> pitch estimation use autocorrelation, AMDF, or more exotic methods
> like wavelet transforms.  You need to know the pitch for PSOLA to work
> properly.
> 
> What you are trying to do is a mid-level DSP problem.  This kind of
> thing is difficult for those with Bachelor's degrees and I would think
> nearly impossible for someone with no formal DSP training.
> 
> CC

Reply by Jerry Avins ●January 6, 20042004-01-06

Richard Owlett wrote:

> Given an audio file in WAV format
> If it is desired to REDUCE SPEECH rate
> 
> Where is faulr in following senario
> 
> ASSUME speech rate to be modified by factor "K"
> 
> Read original saving sample rate
> Do FFT of original
> (multiply/divide) each bin by "K"
> Do INVERSE FFT
> Save to WAV file (multiplying/dividing) sample rate by "K"
> 
> [I may have inverted 1 too many times,   but you get the idea ;]

   ...

Be explicit about what "(multiply/divide) each bin by 'K'" means,
supplying detail, and I think you'll answer your own question.

Jerry
-- 
Engineering is the art of making what you want from things you can get.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by George Kustas ●January 8, 20042004-01-08

cliff@netway.com (Cliff Chase) wrote in message 
> Your term granular synthesis is actually somewhat correct in that
> PSOLA is in essence a subset of granular synthesis.  That aside, if
> you've only had Jr. High Physics I think that this task is beyond you.
>  What RBJ is describing is a method whereby the PCM data is read out
> of a circular buffer at a different rate than it is written in.  Since
> the rates are different you actually need two read pointers and you
> fade between the pointers.  The term "fade" describes changing the
> volume as you switch from one pointer to the next.
> 
> Also, your pitch estimation technique won't work.  Typical methods of
> pitch estimation use autocorrelation, AMDF, or more exotic methods
> like wavelet transforms.  You need to know the pitch for PSOLA to work
> properly.
> 
> What you are trying to do is a mid-level DSP problem.  This kind of
> thing is difficult for those with Bachelor's degrees and I would think
> nearly impossible for someone with no formal DSP training.
> 
> CC

Thanks for that simpler explanation, Cliff. So, "fading" is simply
smoothing out the connections between frames in the output buffer so
that you don't get some nasty side effect? I actually have a higher
education than jr. high. I have a BA in computer science, but if you
can believe it, I never had to take physics. I regret it, but not
enough to take it now ;-)

Even so, doing this from scratch IS beyond me, especially the pitch
estimation. Programming it is the EASY part (for me anyway).

Fortunately, I did find source code that I was able to get working
within my application. For those interested, check out:
http://sourceforge.net/projects/mffmtimescale/

It is a C/C++ implementation of WSOLA. It uses another download called
FFTW (fastest fourier transform in the west - cute, huh?).

It took quite a while to get the package loaded, built, and running
under Windows, but it does work, and pretty well. Certainly good
enough for my purposes (voice transcription). It runs faster than real
time, so I can apply rate changes "on the fly".

I'm looking through the code now in an effort to understand it better,
and perhaps "re-package" it as a generic windows DLL or "plug-in". No
offense to its author, but there was a lot of clean-up to get it to
build, and it has too many dependencies (other DLLs and source code
that carry excess baggage not needed for time scaling).

time stretching for dictation - granular synthesis

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group