DSPRelated.com
Forums

time stretching for dictation - granular synthesis

Started by George Kustas January 4, 2004
I am a windows programmer developing an application for medical
transcription. I have an audio player/recorder that supports windows
wav files (PCM). I am looking for help doing time stretching, or
simply put, slowing down the playback without changing the pitch, so
that the transcriptionist can more easily transcribe the dictation.

I've been looking in Google for a day now - I found lot's of brief
descriptions on granular synthesis and Fourier Transforms.
Unfortunately, I'm not a "dsp guy" an a lot of this went over my head.

I also downloaded the sourcecode for the CSound application, but I've
been unable to decipher just where the granular synthesis is being
done, nor do I think I could figure it out because it isn't documented
very well.

Can anyone help by posting a "simple" algorithm (or explanation) for
granular synthesis of PCM data? Source code would be great too!
In article b4c52ba8.0401041200.259077d1@posting.google.com, George Kustas at
gkustas@hvc.rr.com wrote on 01/04/2004 15:00:

> I am a windows programmer developing an application for medical > transcription. I have an audio player/recorder that supports windows > wav files (PCM). I am looking for help doing time stretching, or > simply put, slowing down the playback without changing the pitch, so > that the transcriptionist can more easily transcribe the dictation. > > I've been looking in Google for a day now - I found lot's of brief > descriptions on granular synthesis and Fourier Transforms. > Unfortunately, I'm not a "dsp guy" an a lot of this went over my head. > > I also downloaded the sourcecode for the CSound application, but I've > been unable to decipher just where the granular synthesis is being > done, nor do I think I could figure it out because it isn't documented > very well. > > Can anyone help by posting a "simple" algorithm (or explanation) for > granular synthesis of PCM data? Source code would be great too!
i am not sure that you would want exactly "granular synthesis" but instead a simple variant of the PSOLA (pitch synchronous overlap add) method or sometimes called WSOLA (windowed synchronous overlap add) method. the concept is pretty simple: given your input x[n], the output is a sum of delayed and cross-faded (the "window" would be the concatenation of the fade-up and fade-down functions applied to the adjacent sections of x[n]) sections of the input: assuming identical length of fade-up and fade-down of N samples (so the non-zero length of the window is 2*N-1 samples), the window function would be w[n]. for example, if linear cross-fading was used, w[n] would be an even symmetric triangular function that would be: { 1 - |n|/N for -N < n < N w[n] = { { 0 otherwise the output would look like: y[n] = x[n]*w[n] + x[n-O(1)]*w[n-N] + x[n-O(2)]*w[n-2N] + x[n-O(3)]*w[n-3N] + ... or, more generally: y[n] = SUM{ x[n-O(k)]*w[n-k*N] } k where O(k) is the offset of the k_th segment to accomplish slowing down the sound. some people will call that segment, x[n-O(k)]*w[n-k*N], the k_th "grain" but i am not sure i would. only two windows will be non-zero at any one time so you are really only mixing 2 windowed input segments. if "alpha" is your time stretching factor (if you're stretching, alpha>1, and if you're time-compressing, alpha<1) then n/alpha - P/2 < n - O(k) < n/alpha + P/2 for k*N - N/2 <= n < k*N + N/2 where P is the maximum period of the input x[n] around the time of n/alpha. the latter inequality is the same as k - 1/2 <= n/N < k + 1/2 which is the same as k = round(n/N) = floor(n/N + 1/2) so n/alpha - P/2 < n - O(k) < n/alpha + P/2 is the same as |(1-1/alpha)*n - O(round(n/N))| < P/2 for all n. the offset O(k) is an recursive result such that O(0) = 0 and the average magnitude difference function, (k+1)*N-1 AMDF(k+1) = SUM{ |x[n-O(k)] - x[n-O(k+1)]| } n=k*N is minimum but still satisfies n/alpha - P/2 < n - O(k+1) < n/alpha + P/2 . try to detangle what i wrote and try it. for monophonic dialog (only one voice speaking), it should work pretty well. r b-j
Something which may help is the "Amazing Slow Downer" from Roni Music.  I
downloaded an evaluation copy (only slightly crippled) and was quite
impressed with it as a means of learning fast musical passages by ear.

They are contactable at http://www.ronimusic.com and info@ronimusic.com, or
at least they were 6 months ago.

Jim Adamthwaite.


> i am not sure that you would want exactly "granular synthesis" but instead a > simple variant of the PSOLA (pitch synchronous overlap add) method or > sometimes called WSOLA (windowed synchronous overlap add) method.
That's fine... as you say down below, monophonic voice is all I'm using, so whatever the simplest solution is...
> given your input x[n], the output is a sum of delayed and cross-faded (the > "window" would be the concatenation of the fade-up and fade-down functions > applied to the adjacent sections of x[n]) sections of the input:
OK, you lost me already - sorry! What does "fading" mean? I understand that my PCM data is simply integers representing amplitude. I understand that I can determine the pitch by seeking the second largest (or smallest) value, because this would be one "cycle" or complete wave (am I using the right terminology?), or the "frequency" of the waves. Forgive me, I've only had jr. high physics, but is this windowing you are describing have something to do with repeating cycles as I've described them above? As I see it, I can preserve the pitch if I repeat cycles, because the frequency should remain the same, no?
gkustas@hvc.rr.com (George Kustas) wrote in message news:<b4c52ba8.0401050923.2efc875b@posting.google.com>...
> > i am not sure that you would want exactly "granular synthesis" but instead a > > simple variant of the PSOLA (pitch synchronous overlap add) method or > > sometimes called WSOLA (windowed synchronous overlap add) method. > > That's fine... as you say down below, monophonic voice is all I'm > using, so whatever the simplest solution is... > > > given your input x[n], the output is a sum of delayed and cross-faded (the > > "window" would be the concatenation of the fade-up and fade-down functions > > applied to the adjacent sections of x[n]) sections of the input: > > OK, you lost me already - sorry! What does "fading" mean? I understand > that my PCM data is simply integers representing amplitude. I > understand that I can determine the pitch by seeking the second > largest (or smallest) value, because this would be one "cycle" or > complete wave (am I using the right terminology?), or the "frequency" > of the waves. > > Forgive me, I've only had jr. high physics, but is this windowing you > are describing have something to do with repeating cycles as I've > described them above? As I see it, I can preserve the pitch if I > repeat cycles, because the frequency should remain the same, no?
Your term granular synthesis is actually somewhat correct in that PSOLA is in essence a subset of granular synthesis. That aside, if you've only had Jr. High Physics I think that this task is beyond you. What RBJ is describing is a method whereby the PCM data is read out of a circular buffer at a different rate than it is written in. Since the rates are different you actually need two read pointers and you fade between the pointers. The term "fade" describes changing the volume as you switch from one pointer to the next. Also, your pitch estimation technique won't work. Typical methods of pitch estimation use autocorrelation, AMDF, or more exotic methods like wavelet transforms. You need to know the pitch for PSOLA to work properly. What you are trying to do is a mid-level DSP problem. This kind of thing is difficult for those with Bachelor's degrees and I would think nearly impossible for someone with no formal DSP training. CC
Cliff Chase wrote:

> gkustas@hvc.rr.com (George Kustas) wrote in message news:<b4c52ba8.0401050923.2efc875b@posting.google.com>... > >>>i am not sure that you would want exactly "granular synthesis" but instead a >>>simple variant of the PSOLA (pitch synchronous overlap add) method or >>>sometimes called WSOLA (windowed synchronous overlap add) method. >> >>That's fine... as you say down below, monophonic voice is all I'm >>using, so whatever the simplest solution is... >> >> >>>given your input x[n], the output is a sum of delayed and cross-faded (the >>>"window" would be the concatenation of the fade-up and fade-down functions >>>applied to the adjacent sections of x[n]) sections of the input: >> >>OK, you lost me already - sorry! What does "fading" mean? I understand >>that my PCM data is simply integers representing amplitude. I >>understand that I can determine the pitch by seeking the second >>largest (or smallest) value, because this would be one "cycle" or >>complete wave (am I using the right terminology?), or the "frequency" >>of the waves. >> >>Forgive me, I've only had jr. high physics, but is this windowing you >>are describing have something to do with repeating cycles as I've >>described them above? As I see it, I can preserve the pitch if I >>repeat cycles, because the frequency should remain the same, no? > > > Your term granular synthesis is actually somewhat correct in that > PSOLA is in essence a subset of granular synthesis. That aside, if > you've only had Jr. High Physics I think that this task is beyond you. > What RBJ is describing is a method whereby the PCM data is read out > of a circular buffer at a different rate than it is written in. Since > the rates are different you actually need two read pointers and you > fade between the pointers. The term "fade" describes changing the > volume as you switch from one pointer to the next. > > Also, your pitch estimation technique won't work. Typical methods of > pitch estimation use autocorrelation, AMDF, or more exotic methods > like wavelet transforms. You need to know the pitch for PSOLA to work > properly. > > What you are trying to do is a mid-level DSP problem. This kind of > thing is difficult for those with Bachelor's degrees and I would think > nearly impossible for someone with no formal DSP training. > > CC
You're probably right, but it isn't absolute. A guy with no high-school physics at all taught me to grind and figure telescope mirrors and how to find Aldebaran. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Given an audio file in WAV format
If it is desired to REDUCE SPEECH rate

Where is faulr in following senario

ASSUME speech rate to be modified by factor "K"

Read original saving sample rate
Do FFT of original
(multiply/divide) each bin by "K"
Do INVERSE FFT
Save to WAV file (multiplying/dividing) sample rate by "K"

[I may have inverted 1 too many times,   but you get the idea ;]


Cliff Chase wrote:

> gkustas@hvc.rr.com (George Kustas) wrote in message news:<b4c52ba8.0401050923.2efc875b@posting.google.com>... > >>>i am not sure that you would want exactly "granular synthesis" but instead a >>>simple variant of the PSOLA (pitch synchronous overlap add) method or >>>sometimes called WSOLA (windowed synchronous overlap add) method. >> >>That's fine... as you say down below, monophonic voice is all I'm >>using, so whatever the simplest solution is... >> >> >>>given your input x[n], the output is a sum of delayed and cross-faded (the >>>"window" would be the concatenation of the fade-up and fade-down functions >>>applied to the adjacent sections of x[n]) sections of the input: >> >>OK, you lost me already - sorry! What does "fading" mean? I understand >>that my PCM data is simply integers representing amplitude. I >>understand that I can determine the pitch by seeking the second >>largest (or smallest) value, because this would be one "cycle" or >>complete wave (am I using the right terminology?), or the "frequency" >>of the waves. >> >>Forgive me, I've only had jr. high physics, but is this windowing you >>are describing have something to do with repeating cycles as I've >>described them above? As I see it, I can preserve the pitch if I >>repeat cycles, because the frequency should remain the same, no? > > > Your term granular synthesis is actually somewhat correct in that > PSOLA is in essence a subset of granular synthesis. That aside, if > you've only had Jr. High Physics I think that this task is beyond you. > What RBJ is describing is a method whereby the PCM data is read out > of a circular buffer at a different rate than it is written in. Since > the rates are different you actually need two read pointers and you > fade between the pointers. The term "fade" describes changing the > volume as you switch from one pointer to the next. > > Also, your pitch estimation technique won't work. Typical methods of > pitch estimation use autocorrelation, AMDF, or more exotic methods > like wavelet transforms. You need to know the pitch for PSOLA to work > properly. > > What you are trying to do is a mid-level DSP problem. This kind of > thing is difficult for those with Bachelor's degrees and I would think > nearly impossible for someone with no formal DSP training. > > CC
Richard Owlett wrote:

> Given an audio file in WAV format > If it is desired to REDUCE SPEECH rate > > Where is faulr in following senario > > ASSUME speech rate to be modified by factor "K" > > Read original saving sample rate > Do FFT of original > (multiply/divide) each bin by "K" > Do INVERSE FFT > Save to WAV file (multiplying/dividing) sample rate by "K" > > [I may have inverted 1 too many times, but you get the idea ;]
... Be explicit about what "(multiply/divide) each bin by 'K'" means, supplying detail, and I think you'll answer your own question. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
cliff@netway.com (Cliff Chase) wrote in message 
> Your term granular synthesis is actually somewhat correct in that > PSOLA is in essence a subset of granular synthesis. That aside, if > you've only had Jr. High Physics I think that this task is beyond you. > What RBJ is describing is a method whereby the PCM data is read out > of a circular buffer at a different rate than it is written in. Since > the rates are different you actually need two read pointers and you > fade between the pointers. The term "fade" describes changing the > volume as you switch from one pointer to the next. > > Also, your pitch estimation technique won't work. Typical methods of > pitch estimation use autocorrelation, AMDF, or more exotic methods > like wavelet transforms. You need to know the pitch for PSOLA to work > properly. > > What you are trying to do is a mid-level DSP problem. This kind of > thing is difficult for those with Bachelor's degrees and I would think > nearly impossible for someone with no formal DSP training. > > CC
Thanks for that simpler explanation, Cliff. So, "fading" is simply smoothing out the connections between frames in the output buffer so that you don't get some nasty side effect? I actually have a higher education than jr. high. I have a BA in computer science, but if you can believe it, I never had to take physics. I regret it, but not enough to take it now ;-) Even so, doing this from scratch IS beyond me, especially the pitch estimation. Programming it is the EASY part (for me anyway). Fortunately, I did find source code that I was able to get working within my application. For those interested, check out: http://sourceforge.net/projects/mffmtimescale/ It is a C/C++ implementation of WSOLA. It uses another download called FFTW (fastest fourier transform in the west - cute, huh?). It took quite a while to get the package loaded, built, and running under Windows, but it does work, and pretty well. Certainly good enough for my purposes (voice transcription). It runs faster than real time, so I can apply rate changes "on the fly". I'm looking through the code now in an effort to understand it better, and perhaps "re-package" it as a generic windows DLL or "plug-in". No offense to its author, but there was a lot of clean-up to get it to build, and it has too many dependencies (other DLLs and source code that carry excess baggage not needed for time scaling).