The Phase Vocoder Transform

Christian YostFebruary 12, 2019


I would like to look at the phase vocoder in a fairly "abstract'' way today. The purpose of this is to discuss a method for measuring the quality of various phase vocoder algorithms, and building off a proposed measure used in [2]. We will spend a bit of time in the domain of continuous mathematics, thus defining a phase vocoder function or map rather than an algorithm, and taking a perspective reminiscent of group theory, ultimately concluding that the phase vocoder seems well-defined. After going through all of this, we should be sufficiently confident in the given measurement for phase vocoder quality as being a good one.


Let's start by laying out some notation. 


    $\alpha = $ time modification factor

    $\beta = $ frequency modification factor

    $x(t) = $ analysis/input time domain signal

    $y(t) = $ synthesis/output time domain signal

    $X(\omega) = $ analysis/input frequency domain signal 

    $Y(\omega) = $ synthesis/output frequency domain signal

When the discussion is more theoretical, we will refer to $x(t)$ as simply $x$, per [1]. We will see why later on. 


The phase vocoder map, $PV(x,\alpha,\beta)$ is defined in the following equation

$$PV\big(x,\alpha,\beta\big) = \int_{-\infty}^{\infty}\big|X\big(\frac{\omega}{\beta},\alpha t\big)\big|\cdot e^{i\phi_{pv}(\omega,t)}d\omega = y(t)$$

where the phase vocoder phase function $\phi_{pv}(t)$ is 

$$\phi_{pv}(\omega, t) = \angle X(\frac{\omega}{\beta},0) + \int_{0}^{t} \frac{d}{dt}\big[\phi(\omega,t)\big] dt = \angle X(\frac{\omega}{\beta},0) + \phi(\omega,t)$$

The thinking here is that our time modification factor $\alpha$ is a "percent of the original speed''. So, if you wanted to slow the signal down by a factor of two, in other words at fifty percent of the original speed, then $\alpha = 0.5$. The frequency modification factor $\beta$, is a scaling of original frequency domain data. So if you want to shift the frequencies of the input signal up an octave ($\beta = 2$), then the data in the output signal for frequency $\omega$ is equal to the frequency domain data in the input signal of the frequency $\frac{\omega}{2}$.  

In the equation $\phi_{pv}$, initially integrating the derivative itself may seem a bit strange, but here we are simply trying to give a continuous representation to the discrete algorithm we already know and love. The important bit in the equation $\phi_{pv}$ is that the phase offset is set to the initial phase of the properly scaled frequency information of the input signal. 

The Weeds

The phase vocoder acts like a map between two sets, the elements of which are signals. These signals have finite energy and can be understood as vectors in a Hilbert Space $\mathbb{H}$, known as a signal space, as elaborated on in [1]. When referring to the vector as a whole, $x(t)$ will be referred to as simply $x$. The domain is the singleton set of the input signal $x(t)$ itself: $\{ x \}$. The co-domain, or range, is the set of all signals $y(t)$ such that $y = PV(x,\alpha,\beta)$ and $\alpha, \beta \in \mathbb{R} - \{0\}$. In total,

$$ PV: \mathcal{X} \mapsto \mathcal{Y}$$

$$ \text{such that } x,y \in \mathbb{H} \ \ \forall (x,y) \in (\mathcal{X},\mathcal{Y})$$

We see here that the input signal $x$ acts like a generator of the co-domain $\mathcal{Y}$. When $\alpha = 1$ and $\beta = 1$, $PV$ acts like the identity map since $PV(x,1,1) = x$. 

We will make the additional claim that this phase vocoder map is bijective. To motivate this, consider some of the characteristics of the input signal, which are the quantities we are modifying: duration and pitch. Let the duration of $x(t)$ be $N$. Since $x(t)$ is a signal of finite energy, we will assume that $N$ is finite. Consequently, the duration of our output signal is defined as $\frac{N}{\alpha}$. We see the length of the input signal and output signal are linearly related. Thus the length of the output signal is bijective.

Furthermore, consider the frequency characteristics of the output signal. For a frequency modification factor of $\beta$, we want the frequency domain characteristics of $Y(\omega)$ to be that of $X(\frac{\omega}{\beta})$. We see the frequency information of the input signal and output signal are linearly related. Thus the frequency information of the output signal is bijective.

An output signal of a certain duration and frequency relationship to the input is only possible with a unique $\alpha_{0}$ and $\beta_{0}$. Because of this we will say that the phase vocoder map is one-to-one and onto, in other words a bijection. 

Since $PV(x,\alpha,\beta)$ is a bijective map, there must exist an inverse, $PV^{-1}$. We will define the inverse phase vocoder map in the following equation.

$$PV^{-1}(x,\alpha,\beta) = PV(y,\frac{1}{\alpha},\frac{1}{\beta})$$

Because $\alpha, \beta \neq 0$, we know that $PV$ is a well-behaved and well-defined mapping. 

The fact that the phase vocoder acts like a bijective map is important because it tells us that the signals we are generating from an initial input $x$ are unique for each ordered pair $(\alpha,\beta) \in \mathbb{R} - \{0\} \times \mathbb{R} - \{0\}$. Because of this, we can be confident that operations performed on $y \in \mathcal{Y}$ are reflective of the phase vocoder map $PV$ itself, and not some other curious circumstances which were overlooked. 

In the world of continuous mathematics and analog signals the inverse phase vocoder returns the exact input signal which we originally gave it. However, when working in the world of DSP, certain "phasey'' artifacts arise in the output phase vocoder signal as a result of spectral leakage and sinusoids of varying frequency. Thus, when we perform the inverse phase vocoder in a digital signal processing context, our signal we get back isn't identical to the one we originally started with. We will see how to use these artifacts to judge the effectiveness of a phase vocoder algorithm.

Digital Considerations

The properties we just laid out are a bit different when we reenter the digital signal realm. Specifically our frequency modification is no longer injective because of aliasing. They are now cyclic, and for a sampling frequency of $f_{s}$ we redefine 

$$\beta^{*} \equiv \beta \mod f_{s}$$

Furthermore, our choice of $\alpha$ is restricted such that 

$$\alpha \times N \in [1,\infty)$$

since we can't have a signal less than $1$ sample, and we have limited data storage. However, it is near impossible to think of a phase vocoder application that exceeds these bounds, so we will assume these restrictions are met from here on. We will use this notion of the phase vocoder transform as a measure for the resolution of our DSP phase vocoders. 

Quality Measurement

Laroche and Dolson give us the following Consistency Measure in [2] to quantify the effectiveness of a phase vocoder algorithm. 

$$D_{M} = \frac{\sum_{u = 1}^{P}\sum_{k=0}^{N-1}\big[ | Z(t_{s}^{u},\omega_{k}) | - | X(t_{s}^{u},\omega_{k}) | \big]^{2}}{\sum_{u = 1}^{P}\sum_{k=0}^{N-1} | X(t_{s}^{u},\omega_{k}) |^{2}}$$

This has been slightly modified from the original version: now we are comparing the twice modified signal, $z(t) = PV^{-1}(x(t),\alpha,\beta)$, to the original input, $x(t)$. $D_{M}$ is comparing the squared energy added by the phase vocoder algorithm, to the squared energy of the original signal. If perfect reconstruction is achieved, $D_{M} = 0$. In the following section, we will look at the consistency measure of the Identity Phase Locking algorithm which Laroche and Dolson proposed in [2]. 

MATLAB Results

The linked MATLAB code performs this forward and inverse phase vocoder operation, and compares the resultant signal with the original one using the consistency measure $D_{M}$. We see these results in Figure 1 and Table 1 where our FFT size is $4096$ with a hop factor of $4$. 

In both phase vocoder reconstructions, we see that a fair amount of energy is added by the phase vocoder. However, we should note that the energy added is exaggerated in our $D_{M}$, since we are performing the discrete phase vocoder algorithm twice, and in the second instance taking in an already phasey signal as input. We see that the Identity Phase Locking algorithm consistently outperforms the classic phase vocoder algorithm in terms of $D_{M}$, as also shown in [2]. 


This investigation has brought us through some fringe areas for DSP (Hilbert spaces, group theory, continuous mathematics) in order to give us confidence in using the idea of a phase vocoder transform to judge the quality of a given algorithm. Not only does it give us a slightly different application of the consistency measure $D_{M}$, but we have thought about the phase vocoder and some of the ideas behind it in perhaps a new way: continuously. This is always a healthy practice, as we continue to move through more complete and physical understandings of the powerful ideas employed by digital signal processing.


[1] Robert G. Gallager. Signal Space Concepts.

[2] Mark Dolson Jean Laroche. Improved Phase Vocoder Time-Scale Modification of

Audio. IEEE, 7(3):323–332, 1999.



PDF download

my website

Previous post by Christian Yost:
   A Markov View of the Phase Vocoder Part 2


To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.

Sign up

I agree with the terms of use and privacy policy.

Subscribe to occasional newsletter. VERY easy to unsubscribe.
or Sign in