comp.dsp | Phase Vocoder Vs. Vocoder, and Phase Vocoder relationship with Pitch Shifting

Hi Folks,

In following up some of the posts here recently, I realized that there
is some serious holes in my understanding of vocoders and pitch
shifting.

1. I looked around and couldn't find a good definition of a plain old
"vocoder." My thought is that it is a device with two inputs x1 and
x2 and one output, such that output is input x1 with its magnitude
spectrum modified by the magnitude spectrum of x2. That is,

  Y(f) = X1(f) * |X2(f)|,

or

  Y(f) = r1 * r2 * exp(j*2*pi*f + phi1),

Is this correct?

2. Then what's the difference between a PHASE vocoder and a plain
old vocoder?

3. What has pitch shifting got to do with a phase vocoder?

-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Les Cargill ●December 17, 20122012-12-17

Randy Yates wrote:
> Hi Folks,
>
> In following up some of the posts here recently, I realized that there
> is some serious holes in my understanding of vocoders and pitch
> shifting.
>
> 1. I looked around and couldn't find a good definition of a plain old
> "vocoder." My thought is that it is a device with two inputs x1 and
> x2 and one output, such that output is input x1 with its magnitude
> spectrum modified by the magnitude spectrum of x2. That is,
>
>    Y(f) = X1(f) * |X2(f)|,
>
> or
>
>    Y(f) = r1 * r2 * exp(j*2*pi*f + phi1),
>
> Is this correct?
>

<handwaving and confusion follow>

There are a great many things people call vocoders. I call the Moog 
Modular plus accoutrements  that "sang" on the Clockwork Orange 
soundtrack *the* vocoder, but there are even voice based codecs
that are called vocoders - like CELP. Obviously, this
verges into perceptual encoding ( which was what vocoding started out
as  - they were trying to figure out how to disassemble and reassemble 
the human voice, probably to get rate reduction ).

But the basic idea is to seperate the formants from the pitch, then
substitute pitch from another source. SFAIK, there is more
than one way to do this.

> 2. Then what's the difference between a PHASE vocoder and a plain
> old vocoder?
>

I cheated and looked - a phase vocoder scales both
frequency and time domain.  I suppose then that a
non-phase vocoder just scales ( changes amplitude)
in the time  domain.

No, I don't know why that's interesting, other than
that Autotune uses it.

Wikipedia pointed me to this:
http://www.ircam.fr/461.html?&L=1

and there is this:
http://people.bath.ac.uk/masrwd/pvplugs.html

> 3. What has pitch shifting got to do with a phase vocoder?
>

Dunno.

--
Les Cargill

Reply by robert bristow-johnson ●December 18, 20122012-12-18

On 12/17/12 9:39 PM, Les Cargill wrote:
> Randy Yates wrote:
>> Hi Folks,
>>
>> In following up some of the posts here recently, I realized that there
>> is some serious holes in my understanding of vocoders and pitch
>> shifting.
>>
>> 1. I looked around and couldn't find a good definition of a plain old
>> "vocoder." My thought is that it is a device with two inputs x1 and
>> x2 and one output, such that output is input x1 with its magnitude
>> spectrum modified by the magnitude spectrum of x2. That is,
>>
>> Y(f) = X1(f) * |X2(f)|,
>>

i thought it was more like

    Y(f)  =  ( |X1(f)| + epsilon )^(-1) * X1(f) * |X2(f)|

so it's trying to keep the same frequency components in x1(t), but apply 
the amplitude that x2(t) has at a particular frequency to x1.  what this 
does is impose the spectral envelope of x2 onto x1 after first 
normalizing out the spectral envelope in x1.  the value epsilon>0 is 
very small and it is there to prevent division by zero.

>> or
>>
>> Y(f) = r1 * r2 * exp(j*2*pi*f + phi1),
>>
>> Is this correct?
>>

i dunno about that latter equation.  it doesn't look familiar to me.

>
> <handwaving and confusion follow>
>
> There are a great many things people call vocoders. I call the Moog
> Modular plus accoutrements that "sang" on the Clockwork Orange
> soundtrack *the* vocoder,

boy, i was thinking of the other Kubric production: 2001, A Space 
Odyssey with the song HAL 9000 was singing as his consciousness was 
slipping away.  i think it was Daisy.  that was an early use of a 
vocoder of some sort.

> but there are even voice based codecs
> that are called vocoders - like CELP. Obviously, this
> verges into perceptual encoding ( which was what vocoding started out
> as - they were trying to figure out how to disassemble and reassemble
> the human voice, probably to get rate reduction ).
>
> But the basic idea is to seperate the formants from the pitch, then
> substitute pitch from another source. SFAIK, there is more
> than one way to do this.
>
>
>> 2. Then what's the difference between a PHASE vocoder and a plain
>> old vocoder?
>>

in the EE and DSP disciplines, i always thought that the "vocoder" meant 
the "phase vocoder".  ever since Portnoff.

but in the music industry, there are vocoders that work sorta in the 
time domain with a big bank of resonant filters in parallel.  a multi- 
band thingie.  for each filter in the bank it would look sorta like:

     x2  ----->[BPF]---->|u|----->[LPF]------.
                                             |
                                             |
                                             V
     x1  ----->[BPF]----------------------->(x)-----> y

the two BPFs are identically tuned and the LPF is there to smooth out 
the envelope coming out of the rectified output |u| from the x2 BPF. 
it's hard to do this, but you try to make the BPFs of adjacent bands to 
be complementary so that if there is no effect from x2, the outputs of 
all of the x1 BPFs will add to about the same x1 that goes in.

the output of this particular band is added to the outputs of the other 
bands.  this does not normalize the spectrum of x1, just applies the 
spectrum of x2 onto x1.

it's "sorta" time domain because no FFT is used, but a multiband bank of 
filters is a sorta Fourier analyzer in itself.

the phase vocoder might be programmed to try to accomplish the same 
effect but in the frequency domain after the FFT.  the spectrum of one 
input is applied to the other input.

>
> I cheated and looked - a phase vocoder scales both
> frequency and time domain. I suppose then that a
> non-phase vocoder just scales ( changes amplitude)
> in the time domain.
>
> No, I don't know why that's interesting, other than
> that Autotune uses it.
>

as best as i can tell, but i haven't looked Autotune over closely for 
about a decade, is that Autotune is a time-domain alg.  not a phase 
vocoder or anything frequency domain.  my guess is that Autotune remains 
a time-domain alg.  maybe they went to multiband pitch shifting, but i 
don't know that.

>
>> 3. What has pitch shifting got to do with a phase vocoder?
>>
>
> Dunno.
>

some pitch shifters are a combination of two algs, a time-scaler 
(something that slows down or speeds up the audio without affecting the 
pitch) and a resampler (which is a digital counterpart to varispeed on 
analog tape) which will affect the speed and the pitch predictably. 
combined, these two algorithms can change the pitch without changing the 
speed.

the resampler is mathematically a pretty well solved problem even though 
there are different commercial products that have different performance 
metrics: http://src.infinitewave.ca/ .  anyway, if you looked at the 
resampling issue as one of just getting the sinc function long enough 
and good enough and simply apply the Shannon reconstruction formula, 
resampling is a "mature technology".

but the time scaler (either time compression or time stretching) is not. 
  that's still a little bit like alchemy (and not the old music/audio 
product).  to do glitch-free time-scaling requires some kind of 
algorithmic guessing and some tradeoffs.  this time scaler can be done 
totally in the time domain by splicing in repeated audio (if you're 
stretching) or splicing out audio if you're compressing.  this is where 
all of that SOLA and PSOLA and WSOLA come from.  they sorta all divide 
the audio into frames of some sort and overlap-add in the output.  the 
frames of audio are spaced farther apart in the output than the input if 
you're stretching or are spaced closer together in the output than the 
input if you're compressing.

the traditional electrical engineering definition of the phase vocoder 
can be programmed to do this time stretching (or compressing) in the 
frequency domain, frame-by-frame.  in the original use, the phase 
vocoder would adjust the phase of sinusoidal components so that, in the 
reconstruction, the phase of a certain sinusoid in the previous frame 
are coherent with the same or corresponding sinusoid in the current 
frame.  we can sorta do that in the time domain without a phase vocoder, 
but we can only impose the same delay (or the same linear phase 
component) to all frequency components in the time domain, whereas with 
a phase vocoder we can do it for each frequency component independently 
of the others.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by Randy Yates ●December 18, 20122012-12-18

robert bristow-johnson <rbj@audioimagination.com> writes:

> On 12/17/12 9:39 PM, Les Cargill wrote:
>> Randy Yates wrote:
>>> Hi Folks,
>>>
>>> In following up some of the posts here recently, I realized that there
>>> is some serious holes in my understanding of vocoders and pitch
>>> shifting.
>>>
>>> 1. I looked around and couldn't find a good definition of a plain old
>>> "vocoder." My thought is that it is a device with two inputs x1 and
>>> x2 and one output, such that output is input x1 with its magnitude
>>> spectrum modified by the magnitude spectrum of x2. That is,
>>>
>>> Y(f) = X1(f) * |X2(f)|,
>>>
>
> i thought it was more like
>
>    Y(f)  =  ( |X1(f)| + epsilon )^(-1) * X1(f) * |X2(f)|
>
> so it's trying to keep the same frequency components in x1(t), but
> apply the amplitude that x2(t) has at a particular frequency to x1.
> what this does is impose the spectral envelope of x2 onto x1 after
> first normalizing out the spectral envelope in x1.  the value
> epsilon>0 is very small and it is there to prevent division by zero.

OK, good point - I was wondering about that (whether the original
spectrum should be cancelled or not).

>>> or
>>>
>>> Y(f) = r1 * r2 * exp(j*2*pi*f + phi1),
>>>
>>> Is this correct?
>>>
>
> i dunno about that latter equation.  it doesn't look familiar to me.

Duh. I meant to just reexpress the previous Y(f) in polar form, e.g., 

  Y(f) = r2(f) * e^(j*phi1(f)),

where I'm cancelling the original X1 magnitude as you suggested.

>> <handwaving and confusion follow>
>>
>> There are a great many things people call vocoders. I call the Moog
>> Modular plus accoutrements that "sang" on the Clockwork Orange
>> soundtrack *the* vocoder,
>
> boy, i was thinking of the other Kubric production: 2001, A Space
> Odyssey with the song HAL 9000 was singing as his consciousness was
> slipping away.  i think it was Daisy.  that was an early use of a
> vocoder of some sort.
>
>> but there are even voice based codecs
>> that are called vocoders - like CELP. Obviously, this
>> verges into perceptual encoding ( which was what vocoding started out
>> as - they were trying to figure out how to disassemble and reassemble
>> the human voice, probably to get rate reduction ).

Yes, there are those too, true. I meant the musical variety though.

>>> 2. Then what's the difference between a PHASE vocoder and a plain
>>> old vocoder?
>>>
>
> in the EE and DSP disciplines, i always thought that the "vocoder"
> meant the "phase vocoder".  ever since Portnoff.
>
> but in the music industry, there are vocoders that work sorta in the
> time domain with a big bank of resonant filters in parallel.  a multi- 
> band thingie.  for each filter in the bank it would look sorta like:
>
>
>     x2  ----->[BPF]---->|u|----->[LPF]------.
>                                             |
>                                             |
>                                             V
>     x1  ----->[BPF]----------------------->(x)-----> y
>
>
> the two BPFs are identically tuned and the LPF is there to smooth out
> the envelope coming out of the rectified output |u| from the x2 BPF.
> it's hard to do this, but you try to make the BPFs of adjacent bands
> to be complementary so that if there is no effect from x2, the outputs
> of all of the x1 BPFs will add to about the same x1 that goes in.
>
> the output of this particular band is added to the outputs of the
> other bands.  this does not normalize the spectrum of x1, just applies
> the spectrum of x2 onto x1.
>
> it's "sorta" time domain because no FFT is used, but a multiband bank
> of filters is a sorta Fourier analyzer in itself.
>
> the phase vocoder might be programmed to try to accomplish the same
> effect but in the frequency domain after the FFT.  the spectrum of one
> input is applied to the other input.
>
>>
>> I cheated and looked - a phase vocoder scales both
>> frequency and time domain. I suppose then that a
>> non-phase vocoder just scales ( changes amplitude)
>> in the time domain.
>>
>> No, I don't know why that's interesting, other than
>> that Autotune uses it.
>>
>
> as best as i can tell, but i haven't looked Autotune over closely for
> about a decade, is that Autotune is a time-domain alg.  not a phase
> vocoder or anything frequency domain.  my guess is that Autotune
> remains a time-domain alg.  maybe they went to multiband pitch
> shifting, but i don't know that.
>
>>
>>> 3. What has pitch shifting got to do with a phase vocoder?
>>>
>>
>> Dunno.
>>
>
> some pitch shifters are a combination of two algs, a time-scaler
> (something that slows down or speeds up the audio without affecting
> the pitch) and a resampler (which is a digital counterpart to
> varispeed on analog tape) which will affect the speed and the pitch
> predictably. combined, these two algorithms can change the pitch
> without changing the speed.
>
> the resampler is mathematically a pretty well solved problem even
> though there are different commercial products that have different
> performance metrics: http://src.infinitewave.ca/ .  anyway, if you
> looked at the resampling issue as one of just getting the sinc
> function long enough and good enough and simply apply the Shannon
> reconstruction formula, resampling is a "mature technology".
>
> but the time scaler (either time compression or time stretching) is
> not. that's still a little bit like alchemy (and not the old
> music/audio product).  to do glitch-free time-scaling requires some
> kind of algorithmic guessing and some tradeoffs.  this time scaler can
> be done totally in the time domain by splicing in repeated audio (if
> you're stretching) or splicing out audio if you're compressing.  this
> is where all of that SOLA and PSOLA and WSOLA come from.  they sorta
> all divide the audio into frames of some sort and overlap-add in the
> output.  the frames of audio are spaced farther apart in the output
> than the input if you're stretching or are spaced closer together in
> the output than the input if you're compressing.
>
> the traditional electrical engineering definition of the phase vocoder
> can be programmed to do this time stretching (or compressing) in the
> frequency domain, frame-by-frame.  in the original use, the phase
> vocoder would adjust the phase of sinusoidal components so that, in
> the reconstruction, the phase of a certain sinusoid in the previous
> frame are coherent with the same or corresponding sinusoid in the
> current frame.  we can sorta do that in the time domain without a
> phase vocoder, but we can only impose the same delay (or the same
> linear phase component) to all frequency components in the time
> domain, whereas with a phase vocoder we can do it for each frequency
> component independently of the others.

OK, thanks for that Robert but I'm still a little confused. I normally
think of the frequency control input of a pitch shifter as controlled by
a human (rather than by a signal). Are you saying that in a
pitch-shifter-based phase vocoder the frequency control input is the
"frequency of x(1)" (using the convention I established previously)?
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Richard Dobson ●December 18, 20122012-12-18

On 18/12/2012 02:39, Les Cargill wrote:
..
> and there is this:
> http://people.bath.ac.uk/masrwd/pvplugs.html
>

<grin>
I did those a long time ago now, and it is even a bit surprising people 
are still using them.

The canonical reference for the phase vocoder is the paper by Mark Dolson:

www.eumus.edu.uy/eme/ensenanza/electivas/dsp/presentaciones/PhaseVocoderTutorial.pdf

It's called a "phase vocoder" (as distinct from a "channel vocoder") 
because it tracks the running phase of each bin between analysis frames, 
to produce more or less accurate frequency estimates.

>> 3. What has pitch shifting got to do with a phase vocoder?
>>
>

It's just one of the things musicians use it for. All that happens is 
that, given we have a stream of amplitude/frequency frames, we simply 
multiply each frequency value by whatever, make sure the result goes 
into an appropriate bin, and resynthesize using the IFFT. If we use an 
oscillator bank, we don't even need to worry about appropriate bins.

Basically, pvoc analysis gives us a stream of (overlapping) 
amplitude/frequency frames, with which we can do what we like - in 
effect we treat the data as input to an oscillator bank, and if we 
choose to and fiddle with things as needed for the FFT, we can use pvoc 
for fast resynthesis. Musicians are scant respecters of DSP propriety, 
and will for example happily "zero" bins any which way, in pursuit of 
interesting effects. We also use pvoc for cross-synthesis, morphing, and 
so forth.

In my latest um, invention, I use the sliding DFT (new frame every 
sample), with frequency-domain windowing, and apply audio-rate FM to 
each bin, to produce what I have called "Transformational FM" - i.e. 
modulating the pitch sample by sample. Like classic Chowning-style FM, 
but with an arbitrary audio input. Veerrryyy slow on an ordinary 
computer (not least because there may be > 1024 oscillators involved), 
but we (me and my chums at Bath Uni) got it running in real time on a 
mid-range commodity GPU. We have submitted a research funding 
application so we can develop it properly. My home PC is too old to 
install a modern GPU card :-(.

And of course all this stuff is in Csound too.

Richard Dobson

Reply by Randy Yates ●December 18, 20122012-12-18

What type of vocoder was this?

http://www.youtube.com/watch?v=w5beTy9SnkU
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Les Cargill ●December 19, 20122012-12-19

Randy Yates wrote:
> What type of vocoder was this?
>
> http://www.youtube.com/watch?v=w5beTy9SnkU
>

http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html

Allegedly also used for the "Cylon" voices on
the original, Lorne Greene "Battlestar Galactica".

--
Les Cargill

Reply by Randy Yates ●December 19, 20122012-12-19

Les Cargill <lcargill99@comcast.com> writes:

> Randy Yates wrote:
>> What type of vocoder was this?
>>
>> http://www.youtube.com/watch?v=w5beTy9SnkU
>>
>
> http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html
>
> Allegedly also used for the "Cylon" voices on
> the original, Lorne Greene "Battlestar Galactica".

Well that's cool, but my question was, is it a "plain" vocoder or a
"phase vocoder?"
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Les Cargill ●December 19, 20122012-12-19

Randy Yates wrote:
> Les Cargill <lcargill99@comcast.com> writes:
>
>> Randy Yates wrote:
>>> What type of vocoder was this?
>>>
>>> http://www.youtube.com/watch?v=w5beTy9SnkU
>>>
>>
>> http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html
>>
>> Allegedly also used for the "Cylon" voices on
>> the original, Lorne Greene "Battlestar Galactica".
>
> Well that's cool, but my question was, is it a "plain" vocoder or a
> "phase vocoder?"
>

Heh. I read "type" as meaning "model".... Doh!

I do not know which class of device that is.

--
Les Cargill

Reply by Richard Dobson ●December 19, 20122012-12-19

On 19/12/2012 04:44, Randy Yates wrote:
> Les Cargill <lcargill99@comcast.com> writes:
>
>> Randy Yates wrote:
>>> What type of vocoder was this?
>>>
>>> http://www.youtube.com/watch?v=w5beTy9SnkU
>>>
>>
>> http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html
>>
>> Allegedly also used for the "Cylon" voices on
>> the original, Lorne Greene "Battlestar Galactica".
>
> Well that's cool, but my question was, is it a "plain" vocoder or a
> "phase vocoder?"
>
  Plain vocoder, aka "analog vocoder". Essentially the same type as used 
by ELO ("Mr Blue Sky" etc). A filter bank receives the speech input 
speech, splits it into bands (more or less roughly capturing the vocal 
formants); the varying output level of each filter controls a matching 
filter applied to the synth input; a bit like a rapidly modulated 
graphic EQ.

A video describing the EMS vocoder (easily the most famous model) is here:

http://www.youtube.com/watch?v=Sli2JRQ0i8w

Another classic one was the Roland vocoder:

http://www.youtube.com/watch?v=cOO6xTXTeiA

Needless to say, we can do this sort of thing with the phase vocoder 
too. The analog vocoder might have a dozen filter bands, the phase 
vocoder will have 100s or 1000s depending on the FFT size used. So 
sometimes some work has to be done to simulate a smaller number of bands 
when using the phase vocoder. To that extent, they are both forms of 
input-controlled time-varying filter banks. The phase vocoder is just a 
lot bigger, and unlikely ever to be implemented in analogue form.

Richard Dobson

Phase Vocoder Vs. Vocoder, and Phase Vocoder relationship with Pitch Shifting

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group