DSPRelated.com
Forums

Time scale modification?

Started by John McDermick March 29, 2011
Which algorithm is most commonly used for real-time time scale
modification of an audio signal?




If by scale you mean amplitude, multiplication (with attention to saturation).

Jerry
-- 
Engineering is the art of making what you want from things you can get.
On Mar 29, 12:19&#4294967295;pm, Jerry Avins <j...@ieee.org> wrote:
> If by scale you mean amplitude, multiplication (with attention to saturation). >
Thanks Jerry...but I meant time-scale modification (increasing or decreasing the length of an audio signal without distorting the audio too much).....
On 3/29/2011 9:28 AM, John McDermick wrote:
> On Mar 29, 12:19 pm, Jerry Avins<j...@ieee.org> wrote: >> If by scale you mean amplitude, multiplication (with attention to saturation). >> > > Thanks Jerry...but I meant time-scale modification (increasing or > decreasing the length of an audio signal without distorting the audio > too much).....
John, This is done in video scan converters all the time. Either one wants to reduce the frame rate or one wants to increase the frame rate - relative to the source. Well, at least that's a *type* of time scale conversion. You didn't say if you were wanting to change the sample rate. So let's take the case where you're willing to change the sample rate first: The implementation for this is to put the data into a circular buffer at the input sample rate and to read it out at the desired output sample rate. NOTE: It's a matter of physics and not one of implementation that a faster output rate will cause buffer repetition and a slower output rate will cause buffers to be skipped - as the read/write points in the memory are overtaken by the faster rate. Now, let's say that you want the sample rate to stay the same: In this case you still need to speed up or slow down the output relative to the input. So, I think you have to do the above in some fashion no matter what. Then it's a matter of sample rate conversion which you find under "interpolation" or "decimation". There are likely judicious choices of input sample rate, output sample rate and the interpolation/decimation factor "I". Fred
Thank you Fred..

Let me elaborate...

I would like the sampling rate to stay the same....N samples in...N
samples out...

I have a block-processing module which acquires N samples from a file,
processes N samples and sends N samples to the speaker.

Let's say that the audio samples in the file have been sampled as 8kHz
and that N = 80. Furthermore, let's say that the file contains 8000
samples (100 audio blocks)

In that case, it will take 1 second to play the audio samples.

Now..let's say that I would like to increase the time it takes to play
the audio from 1 second to 1.3 seconds.

Those extra 0.3 seconds corresponds to 0.3s * 8000 Hz  = 2400 samples.

So I have to insert 2400 samples some how in to the audio stream
without distorting the audio, changing the pitch etc. etc.

On average I would have to insert 24 samples per block (calculated as
2400 samples divided by 100 audio blocks ....or 0.3 * 80 samples/
block)

So the way I see it is that the block processing module must acquire
80 samples and put it in a buffer which is 24+80 samples long, do some
processing and then grab the oldest 80 samples and send those samples
to the speaker.

My question is: What processing should be done on the buffer such that
the audio playback is not distorted. I guess some kind of filter
operation?







>Thank you Fred.. > >Let me elaborate... > >I would like the sampling rate to stay the same....N samples in...N >samples out... > >I have a block-processing module which acquires N samples from a file, >processes N samples and sends N samples to the speaker. > >Let's say that the audio samples in the file have been sampled as 8kHz >and that N = 80. Furthermore, let's say that the file contains 8000 >samples (100 audio blocks) > >In that case, it will take 1 second to play the audio samples. > >Now..let's say that I would like to increase the time it takes to play >the audio from 1 second to 1.3 seconds. > >Those extra 0.3 seconds corresponds to 0.3s * 8000 Hz = 2400 samples. > >So I have to insert 2400 samples some how in to the audio stream >without distorting the audio, changing the pitch etc. etc. > >On average I would have to insert 24 samples per block (calculated as >2400 samples divided by 100 audio blocks ....or 0.3 * 80 samples/ >block) > >So the way I see it is that the block processing module must acquire >80 samples and put it in a buffer which is 24+80 samples long, do some >processing and then grab the oldest 80 samples and send those samples >to the speaker. > >My question is: What processing should be done on the buffer such that >the audio playback is not distorted. I guess some kind of filter >operation?
If you are sampling at 8k/s I assume this is speech. Google for TDHS (time domain harmonic scaling) and its many variants - PSOLA, PICOLA, etc. This technique isn't great for music, but it can produce very good results for speech, without incurring too massive a compute load. Get the details right, and you can achieve a very realistic "same person talking faster or slower" effect. Steve
John,

So you maybe want to do pitch shifting? If you want to make a a Yankee's rapid-fire speech slow to a Georgia drawl while keeping the sample rate and pitch constant, it will be very hard to accomplish in real time.

Jerry
-- 
Engineering is the art of making what you want from things you can get.
On 3/29/2011 10:44 AM, John McDermick wrote:
> Now..let's say that I would like to increase the time it takes to play > the audio from 1 second to 1.3 seconds.
OK. And you don't want to slow down the original record (reducing the frequencies) in order to extend the time. That means you need to add something in order to extend the time. Maybe you should look at speech encoding so that the components can be extended in time - keeping the pitch the same but the duration longer. Fred
On 3/29/2011 10:44 AM, John McDermick wrote:
> Thank you Fred.. > > Let me elaborate... > > I would like the sampling rate to stay the same....N samples in...N > samples out... > > I have a block-processing module which acquires N samples from a file, > processes N samples and sends N samples to the speaker. > > Let's say that the audio samples in the file have been sampled as 8kHz > and that N = 80. Furthermore, let's say that the file contains 8000 > samples (100 audio blocks) > > In that case, it will take 1 second to play the audio samples. > > Now..let's say that I would like to increase the time it takes to play > the audio from 1 second to 1.3 seconds. > > Those extra 0.3 seconds corresponds to 0.3s * 8000 Hz = 2400 samples. > > So I have to insert 2400 samples some how in to the audio stream > without distorting the audio, changing the pitch etc. etc. > > On average I would have to insert 24 samples per block (calculated as > 2400 samples divided by 100 audio blocks ....or 0.3 * 80 samples/ > block) > > So the way I see it is that the block processing module must acquire > 80 samples and put it in a buffer which is 24+80 samples long, do some > processing and then grab the oldest 80 samples and send those samples > to the speaker. > > My question is: What processing should be done on the buffer such that > the audio playback is not distorted. I guess some kind of filter > operation? > > > > > > >
A very simple-minded approach would be to do the equivalent of frame repetition and run through the same "block" of samples until enough time has passed and then do the same with the next block and so forth. If the blocks are small enough then you might get away with this. The problem of course is that you will "get behind" and will eventually have to skip input data. Is that what you have in mind? There is no way to extend the time and keep up without skipping. Think of a racetrack with a horse and a racecar. The racecar being faster regularly laps the horse. If the racecar is "writing" on the track and the horse is "reading" from the track then when the horse is lapped, the read data become one track length newer suddenly. And, a track's worth of data is skipped. Fred
On Mar 29, 12:09 pm, John McDermick <johnthedsp...@gmail.com> wrote:
> Which algorithm is most commonly used for real-time time scale > modification of an audio signal?
On Mar 29, 11:22 pm, Jerry Avins <j...@ieee.org> wrote:
> > So you maybe want to do pitch shifting?
Jerry, "pitch shifting" (changing the pitch without changing the tempo) is best thought of in terms of "time scaling" (what John McD is inquiring about) and sample interpolation (essentially the same as sample rate conversion) which is fairly well-defined, mathematically.
> If you want to make a a Yankee's rapid-fire speech slow to a Georgia drawl while keeping the sample rate and pitch constant, it will be very hard to accomplish in real time. >
precisely. since more samples go in than come out (or vise versa), the buffer will run dry or be overrun in a real-time context. so John, "time scaling" is something you normally think is done to a sound file or a sound buffer, which makes it longer or shorter. time scaling in conjunction with sample interpolation can be, and often is, real time because the number of samples going out, on average, is the same as going in. the I/O buffer neither runs dry or explodes. as to your question, i would say that it is common to use a time- domain method often called TDHS (for Time Domain Harmonic Scaling) and sometimes called PSOLA (which i think is not quite right). it essentially inserts or splices in repeated cycles (if you're time- stretching) or deletes or splices out unneeded cycles (if you're time- compressing) and crossfades to make the splices clean. a pitch *detection* alg is used to know how long (in samples) the period or cycle is to splice in or out. that's a time-domain method and it works good on human voice or bird tweets or monophonic musical instrument (like a horn or reed or wind instrument) or some other single-voice source. slowing down or speeding up a polyphonic source (like a whole orchestra or a mix of sounds) using this method can sometimes sound like crap (lot's of "glitches" when the splice ain't so seamless). to time scale polyphonic material, you might use the phase vocoder or "sinusoidal modeling", both frequency-domain algorithms, which essentially breaks the source signal up into individually frequency components and time-scales each frequency component individually, then adds it back together. since harmonics of the same source are stretched (or compressed) separately, they can get outa phase, relative to each other, and the result for a single voice can sound "phasey". i think there are ways to sorta combine the two approaches (time domain vs. frequency domain) so that you can, with a single algorithm, stretch a monophonic voice and not have it sound "phasey" and also stretch a polyphonic mix and not have it sound "glitchy". check out what Melodyne is doing now. they are the current kings of this discipline. r b-j