DSPRelated.com
Forums

Phase Vocoder Vs. Vocoder, and Phase Vocoder relationship with Pitch Shifting

Started by Randy Yates December 17, 2012
Hi Folks,

In following up some of the posts here recently, I realized that there
is some serious holes in my understanding of vocoders and pitch
shifting.

1. I looked around and couldn't find a good definition of a plain old
"vocoder." My thought is that it is a device with two inputs x1 and
x2 and one output, such that output is input x1 with its magnitude
spectrum modified by the magnitude spectrum of x2. That is,

  Y(f) = X1(f) * |X2(f)|,

or

  Y(f) = r1 * r2 * exp(j*2*pi*f + phi1),

Is this correct?

2. Then what's the difference between a PHASE vocoder and a plain
old vocoder?

3. What has pitch shifting got to do with a phase vocoder?

-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com
Randy Yates wrote:
> Hi Folks, > > In following up some of the posts here recently, I realized that there > is some serious holes in my understanding of vocoders and pitch > shifting. > > 1. I looked around and couldn't find a good definition of a plain old > "vocoder." My thought is that it is a device with two inputs x1 and > x2 and one output, such that output is input x1 with its magnitude > spectrum modified by the magnitude spectrum of x2. That is, > > Y(f) = X1(f) * |X2(f)|, > > or > > Y(f) = r1 * r2 * exp(j*2*pi*f + phi1), > > Is this correct? >
<handwaving and confusion follow> There are a great many things people call vocoders. I call the Moog Modular plus accoutrements that "sang" on the Clockwork Orange soundtrack *the* vocoder, but there are even voice based codecs that are called vocoders - like CELP. Obviously, this verges into perceptual encoding ( which was what vocoding started out as - they were trying to figure out how to disassemble and reassemble the human voice, probably to get rate reduction ). But the basic idea is to seperate the formants from the pitch, then substitute pitch from another source. SFAIK, there is more than one way to do this.
> 2. Then what's the difference between a PHASE vocoder and a plain > old vocoder? >
I cheated and looked - a phase vocoder scales both frequency and time domain. I suppose then that a non-phase vocoder just scales ( changes amplitude) in the time domain. No, I don't know why that's interesting, other than that Autotune uses it. Wikipedia pointed me to this: http://www.ircam.fr/461.html?&L=1 and there is this: http://people.bath.ac.uk/masrwd/pvplugs.html
> 3. What has pitch shifting got to do with a phase vocoder? >
Dunno. -- Les Cargill
On 12/17/12 9:39 PM, Les Cargill wrote:
> Randy Yates wrote: >> Hi Folks, >> >> In following up some of the posts here recently, I realized that there >> is some serious holes in my understanding of vocoders and pitch >> shifting. >> >> 1. I looked around and couldn't find a good definition of a plain old >> "vocoder." My thought is that it is a device with two inputs x1 and >> x2 and one output, such that output is input x1 with its magnitude >> spectrum modified by the magnitude spectrum of x2. That is, >> >> Y(f) = X1(f) * |X2(f)|, >>
i thought it was more like Y(f) = ( |X1(f)| + epsilon )^(-1) * X1(f) * |X2(f)| so it's trying to keep the same frequency components in x1(t), but apply the amplitude that x2(t) has at a particular frequency to x1. what this does is impose the spectral envelope of x2 onto x1 after first normalizing out the spectral envelope in x1. the value epsilon>0 is very small and it is there to prevent division by zero.
>> or >> >> Y(f) = r1 * r2 * exp(j*2*pi*f + phi1), >> >> Is this correct? >>
i dunno about that latter equation. it doesn't look familiar to me.
> > <handwaving and confusion follow> > > There are a great many things people call vocoders. I call the Moog > Modular plus accoutrements that "sang" on the Clockwork Orange > soundtrack *the* vocoder,
boy, i was thinking of the other Kubric production: 2001, A Space Odyssey with the song HAL 9000 was singing as his consciousness was slipping away. i think it was Daisy. that was an early use of a vocoder of some sort.
> but there are even voice based codecs > that are called vocoders - like CELP. Obviously, this > verges into perceptual encoding ( which was what vocoding started out > as - they were trying to figure out how to disassemble and reassemble > the human voice, probably to get rate reduction ). > > But the basic idea is to seperate the formants from the pitch, then > substitute pitch from another source. SFAIK, there is more > than one way to do this. > > >> 2. Then what's the difference between a PHASE vocoder and a plain >> old vocoder? >>
in the EE and DSP disciplines, i always thought that the "vocoder" meant the "phase vocoder". ever since Portnoff. but in the music industry, there are vocoders that work sorta in the time domain with a big bank of resonant filters in parallel. a multi- band thingie. for each filter in the bank it would look sorta like: x2 ----->[BPF]---->|u|----->[LPF]------. | | V x1 ----->[BPF]----------------------->(x)-----> y the two BPFs are identically tuned and the LPF is there to smooth out the envelope coming out of the rectified output |u| from the x2 BPF. it's hard to do this, but you try to make the BPFs of adjacent bands to be complementary so that if there is no effect from x2, the outputs of all of the x1 BPFs will add to about the same x1 that goes in. the output of this particular band is added to the outputs of the other bands. this does not normalize the spectrum of x1, just applies the spectrum of x2 onto x1. it's "sorta" time domain because no FFT is used, but a multiband bank of filters is a sorta Fourier analyzer in itself. the phase vocoder might be programmed to try to accomplish the same effect but in the frequency domain after the FFT. the spectrum of one input is applied to the other input.
> > I cheated and looked - a phase vocoder scales both > frequency and time domain. I suppose then that a > non-phase vocoder just scales ( changes amplitude) > in the time domain. > > No, I don't know why that's interesting, other than > that Autotune uses it. >
as best as i can tell, but i haven't looked Autotune over closely for about a decade, is that Autotune is a time-domain alg. not a phase vocoder or anything frequency domain. my guess is that Autotune remains a time-domain alg. maybe they went to multiband pitch shifting, but i don't know that.
> >> 3. What has pitch shifting got to do with a phase vocoder? >> > > Dunno. >
some pitch shifters are a combination of two algs, a time-scaler (something that slows down or speeds up the audio without affecting the pitch) and a resampler (which is a digital counterpart to varispeed on analog tape) which will affect the speed and the pitch predictably. combined, these two algorithms can change the pitch without changing the speed. the resampler is mathematically a pretty well solved problem even though there are different commercial products that have different performance metrics: http://src.infinitewave.ca/ . anyway, if you looked at the resampling issue as one of just getting the sinc function long enough and good enough and simply apply the Shannon reconstruction formula, resampling is a "mature technology". but the time scaler (either time compression or time stretching) is not. that's still a little bit like alchemy (and not the old music/audio product). to do glitch-free time-scaling requires some kind of algorithmic guessing and some tradeoffs. this time scaler can be done totally in the time domain by splicing in repeated audio (if you're stretching) or splicing out audio if you're compressing. this is where all of that SOLA and PSOLA and WSOLA come from. they sorta all divide the audio into frames of some sort and overlap-add in the output. the frames of audio are spaced farther apart in the output than the input if you're stretching or are spaced closer together in the output than the input if you're compressing. the traditional electrical engineering definition of the phase vocoder can be programmed to do this time stretching (or compressing) in the frequency domain, frame-by-frame. in the original use, the phase vocoder would adjust the phase of sinusoidal components so that, in the reconstruction, the phase of a certain sinusoid in the previous frame are coherent with the same or corresponding sinusoid in the current frame. we can sorta do that in the time domain without a phase vocoder, but we can only impose the same delay (or the same linear phase component) to all frequency components in the time domain, whereas with a phase vocoder we can do it for each frequency component independently of the others. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
robert bristow-johnson <rbj@audioimagination.com> writes:

> On 12/17/12 9:39 PM, Les Cargill wrote: >> Randy Yates wrote: >>> Hi Folks, >>> >>> In following up some of the posts here recently, I realized that there >>> is some serious holes in my understanding of vocoders and pitch >>> shifting. >>> >>> 1. I looked around and couldn't find a good definition of a plain old >>> "vocoder." My thought is that it is a device with two inputs x1 and >>> x2 and one output, such that output is input x1 with its magnitude >>> spectrum modified by the magnitude spectrum of x2. That is, >>> >>> Y(f) = X1(f) * |X2(f)|, >>> > > i thought it was more like > > Y(f) = ( |X1(f)| + epsilon )^(-1) * X1(f) * |X2(f)| > > so it's trying to keep the same frequency components in x1(t), but > apply the amplitude that x2(t) has at a particular frequency to x1. > what this does is impose the spectral envelope of x2 onto x1 after > first normalizing out the spectral envelope in x1. the value > epsilon>0 is very small and it is there to prevent division by zero.
OK, good point - I was wondering about that (whether the original spectrum should be cancelled or not).
>>> or >>> >>> Y(f) = r1 * r2 * exp(j*2*pi*f + phi1), >>> >>> Is this correct? >>> > > i dunno about that latter equation. it doesn't look familiar to me.
Duh. I meant to just reexpress the previous Y(f) in polar form, e.g., Y(f) = r2(f) * e^(j*phi1(f)), where I'm cancelling the original X1 magnitude as you suggested.
>> <handwaving and confusion follow> >> >> There are a great many things people call vocoders. I call the Moog >> Modular plus accoutrements that "sang" on the Clockwork Orange >> soundtrack *the* vocoder, > > boy, i was thinking of the other Kubric production: 2001, A Space > Odyssey with the song HAL 9000 was singing as his consciousness was > slipping away. i think it was Daisy. that was an early use of a > vocoder of some sort. > >> but there are even voice based codecs >> that are called vocoders - like CELP. Obviously, this >> verges into perceptual encoding ( which was what vocoding started out >> as - they were trying to figure out how to disassemble and reassemble >> the human voice, probably to get rate reduction ).
Yes, there are those too, true. I meant the musical variety though.
>>> 2. Then what's the difference between a PHASE vocoder and a plain >>> old vocoder? >>> > > in the EE and DSP disciplines, i always thought that the "vocoder" > meant the "phase vocoder". ever since Portnoff. > > but in the music industry, there are vocoders that work sorta in the > time domain with a big bank of resonant filters in parallel. a multi- > band thingie. for each filter in the bank it would look sorta like: > > > x2 ----->[BPF]---->|u|----->[LPF]------. > | > | > V > x1 ----->[BPF]----------------------->(x)-----> y > > > the two BPFs are identically tuned and the LPF is there to smooth out > the envelope coming out of the rectified output |u| from the x2 BPF. > it's hard to do this, but you try to make the BPFs of adjacent bands > to be complementary so that if there is no effect from x2, the outputs > of all of the x1 BPFs will add to about the same x1 that goes in. > > the output of this particular band is added to the outputs of the > other bands. this does not normalize the spectrum of x1, just applies > the spectrum of x2 onto x1. > > it's "sorta" time domain because no FFT is used, but a multiband bank > of filters is a sorta Fourier analyzer in itself. > > the phase vocoder might be programmed to try to accomplish the same > effect but in the frequency domain after the FFT. the spectrum of one > input is applied to the other input. > >> >> I cheated and looked - a phase vocoder scales both >> frequency and time domain. I suppose then that a >> non-phase vocoder just scales ( changes amplitude) >> in the time domain. >> >> No, I don't know why that's interesting, other than >> that Autotune uses it. >> > > as best as i can tell, but i haven't looked Autotune over closely for > about a decade, is that Autotune is a time-domain alg. not a phase > vocoder or anything frequency domain. my guess is that Autotune > remains a time-domain alg. maybe they went to multiband pitch > shifting, but i don't know that. > >> >>> 3. What has pitch shifting got to do with a phase vocoder? >>> >> >> Dunno. >> > > some pitch shifters are a combination of two algs, a time-scaler > (something that slows down or speeds up the audio without affecting > the pitch) and a resampler (which is a digital counterpart to > varispeed on analog tape) which will affect the speed and the pitch > predictably. combined, these two algorithms can change the pitch > without changing the speed. > > the resampler is mathematically a pretty well solved problem even > though there are different commercial products that have different > performance metrics: http://src.infinitewave.ca/ . anyway, if you > looked at the resampling issue as one of just getting the sinc > function long enough and good enough and simply apply the Shannon > reconstruction formula, resampling is a "mature technology". > > but the time scaler (either time compression or time stretching) is > not. that's still a little bit like alchemy (and not the old > music/audio product). to do glitch-free time-scaling requires some > kind of algorithmic guessing and some tradeoffs. this time scaler can > be done totally in the time domain by splicing in repeated audio (if > you're stretching) or splicing out audio if you're compressing. this > is where all of that SOLA and PSOLA and WSOLA come from. they sorta > all divide the audio into frames of some sort and overlap-add in the > output. the frames of audio are spaced farther apart in the output > than the input if you're stretching or are spaced closer together in > the output than the input if you're compressing. > > the traditional electrical engineering definition of the phase vocoder > can be programmed to do this time stretching (or compressing) in the > frequency domain, frame-by-frame. in the original use, the phase > vocoder would adjust the phase of sinusoidal components so that, in > the reconstruction, the phase of a certain sinusoid in the previous > frame are coherent with the same or corresponding sinusoid in the > current frame. we can sorta do that in the time domain without a > phase vocoder, but we can only impose the same delay (or the same > linear phase component) to all frequency components in the time > domain, whereas with a phase vocoder we can do it for each frequency > component independently of the others.
OK, thanks for that Robert but I'm still a little confused. I normally think of the frequency control input of a pitch shifter as controlled by a human (rather than by a signal). Are you saying that in a pitch-shifter-based phase vocoder the frequency control input is the "frequency of x(1)" (using the convention I established previously)? -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
On 18/12/2012 02:39, Les Cargill wrote:
..
> and there is this: > http://people.bath.ac.uk/masrwd/pvplugs.html >
<grin> I did those a long time ago now, and it is even a bit surprising people are still using them. The canonical reference for the phase vocoder is the paper by Mark Dolson: www.eumus.edu.uy/eme/ensenanza/electivas/dsp/presentaciones/PhaseVocoderTutorial.pdf It's called a "phase vocoder" (as distinct from a "channel vocoder") because it tracks the running phase of each bin between analysis frames, to produce more or less accurate frequency estimates.
>> 3. What has pitch shifting got to do with a phase vocoder? >> >
It's just one of the things musicians use it for. All that happens is that, given we have a stream of amplitude/frequency frames, we simply multiply each frequency value by whatever, make sure the result goes into an appropriate bin, and resynthesize using the IFFT. If we use an oscillator bank, we don't even need to worry about appropriate bins. Basically, pvoc analysis gives us a stream of (overlapping) amplitude/frequency frames, with which we can do what we like - in effect we treat the data as input to an oscillator bank, and if we choose to and fiddle with things as needed for the FFT, we can use pvoc for fast resynthesis. Musicians are scant respecters of DSP propriety, and will for example happily "zero" bins any which way, in pursuit of interesting effects. We also use pvoc for cross-synthesis, morphing, and so forth. In my latest um, invention, I use the sliding DFT (new frame every sample), with frequency-domain windowing, and apply audio-rate FM to each bin, to produce what I have called "Transformational FM" - i.e. modulating the pitch sample by sample. Like classic Chowning-style FM, but with an arbitrary audio input. Veerrryyy slow on an ordinary computer (not least because there may be > 1024 oscillators involved), but we (me and my chums at Bath Uni) got it running in real time on a mid-range commodity GPU. We have submitted a research funding application so we can develop it properly. My home PC is too old to install a modern GPU card :-(. And of course all this stuff is in Csound too. Richard Dobson
What type of vocoder was this?

http://www.youtube.com/watch?v=w5beTy9SnkU
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com
Randy Yates wrote:
> What type of vocoder was this? > > http://www.youtube.com/watch?v=w5beTy9SnkU >
http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html Allegedly also used for the "Cylon" voices on the original, Lorne Greene "Battlestar Galactica". -- Les Cargill
Les Cargill <lcargill99@comcast.com> writes:

> Randy Yates wrote: >> What type of vocoder was this? >> >> http://www.youtube.com/watch?v=w5beTy9SnkU >> > > http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html > > Allegedly also used for the "Cylon" voices on > the original, Lorne Greene "Battlestar Galactica".
Well that's cool, but my question was, is it a "plain" vocoder or a "phase vocoder?" -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
Randy Yates wrote:
> Les Cargill <lcargill99@comcast.com> writes: > >> Randy Yates wrote: >>> What type of vocoder was this? >>> >>> http://www.youtube.com/watch?v=w5beTy9SnkU >>> >> >> http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html >> >> Allegedly also used for the "Cylon" voices on >> the original, Lorne Greene "Battlestar Galactica". > > Well that's cool, but my question was, is it a "plain" vocoder or a > "phase vocoder?" >
Heh. I read "type" as meaning "model".... Doh! I do not know which class of device that is. -- Les Cargill
On 19/12/2012 04:44, Randy Yates wrote:
> Les Cargill <lcargill99@comcast.com> writes: > >> Randy Yates wrote: >>> What type of vocoder was this? >>> >>> http://www.youtube.com/watch?v=w5beTy9SnkU >>> >> >> http://m.matrixsynth.com/2006/05/ems-vocoder-5000.html >> >> Allegedly also used for the "Cylon" voices on >> the original, Lorne Greene "Battlestar Galactica". > > Well that's cool, but my question was, is it a "plain" vocoder or a > "phase vocoder?" >
Plain vocoder, aka "analog vocoder". Essentially the same type as used by ELO ("Mr Blue Sky" etc). A filter bank receives the speech input speech, splits it into bands (more or less roughly capturing the vocal formants); the varying output level of each filter controls a matching filter applied to the synth input; a bit like a rapidly modulated graphic EQ. A video describing the EMS vocoder (easily the most famous model) is here: http://www.youtube.com/watch?v=Sli2JRQ0i8w Another classic one was the Roland vocoder: http://www.youtube.com/watch?v=cOO6xTXTeiA Needless to say, we can do this sort of thing with the phase vocoder too. The analog vocoder might have a dozen filter bands, the phase vocoder will have 100s or 1000s depending on the FFT size used. So sometimes some work has to be done to simulate a smaller number of bands when using the phase vocoder. To that extent, they are both forms of input-controlled time-varying filter banks. The phase vocoder is just a lot bigger, and unlikely ever to be implemented in analogue form. Richard Dobson