# How to compare two audio clips for similarity?

Started by July 17, 2007
```Hi DSP experts!

I have a question that i hope some of you can help me with. ;)

What i want to is: given two audio clips, calculate a score for how
similar they are. (how similar they sound)

I assume i have to apply the Fourier transformation on the two clips,
and somehow analyze the two frames (for example by comparing peaks) to
see how similar they are.

How should i do this?

I will be eternally grateful for any pointers! (ideas, explanations,
pointers to literature (websites, etc), ...)

Let me just note that i'm very inexperienced with digital signal
processing - i don't know much about DSP or DSP terminology. All i've
had was a CS course "introduction to digital audio", where i made a
FFT algorithm which i used for doing some filters and timestretch.

--------------------------------------------------------------------

Now that i've asked the question, maybe i should briefly explain what
i'm gonna use it for, to give you some idea of what i'm after. This
may be boring - if you think so, just skip the rest of this post :).

I'm trying to make a program that takes a normal audio clip as input
(wav-file) and then approximates the input sound with simple waveforms
(triangle, sawtooth, pulse, noise). Why? The reason i want to do this,
is so that the approximation to the input sound can be played on an
old computer, which can play 3 voices of these simple waveforms, but
is incapable of playing digitized sounds.

I use FFT with windowing, and thus only approximates small parts of
the input sound by the 3 waveforms at a time (not the entire sound -
it would of cause be impossible to approximate anything but the
simplest input sound by 3 simple waveforms, if the 3 waveforms didn't
vary over time).

There are some parameters of the 3 waveforms i can vary (freq, volume,
etc). For the frame F of each burst B of the input sound, i run
through all values of these parameters (frequency, volume, etc), to
find which set of parameters best approximates the input sound. For
each of these sets of parameters, i generate the sound-samples for the
3 waveforms, and does the FFT on it to get the frame F'. So now i have
the two frames F and F' (one for the input sound and one for the
generated sound). What i want to do, is to compare these two frames,
and get a score for how similar they are, so that i can find the set
of parameters that best approximates the burst B.

I have made a simple comparator, to compare the two frames F and F',
just to test that the rest of the code works. It simply returns a
score for how well the peaks in F matches the peaks in F' (and ignores
everything else but the peaks). (and the way it compares the peaks is
a bit too naive and simple)

This simple method works a bit (it can often follow tones but not
always), but as i said it's just naive sloppy work to see if the rest
worked. Before i begin putting too much work into improving it, it
might be best to get to know what other people have done. Is this the
best approach to compare two frames? If so, could you point me to some
literature (websites, etc) about it? If not, how should i compare the

```
```Bjarke wrote:
> Hi DSP experts!
>
> I have a question that i hope some of you can help me with. ;)
>
> What i want to is: given two audio clips, calculate a score for how
> similar they are. (how similar they sound)
>
> I assume i have to apply the Fourier transformation on the two clips,
> and somehow analyze the two frames (for example by comparing peaks) to
> see how similar they are.
>
> How should i do this?
>
> I will be eternally grateful for any pointers! (ideas, explanations,
> pointers to literature (websites, etc), ...)
>
> Let me just note that i'm very inexperienced with digital signal
> processing - i don't know much about DSP or DSP terminology. All i've
> had was a CS course "introduction to digital audio", where i made a
> FFT algorithm which i used for doing some filters and timestretch.
>
> --------------------------------------------------------------------
>
> Now that i've asked the question, maybe i should briefly explain what
> i'm gonna use it for, to give you some idea of what i'm after. This
> may be boring - if you think so, just skip the rest of this post :).
>
> I'm trying to make a program that takes a normal audio clip as input
> (wav-file) and then approximates the input sound with simple waveforms
> (triangle, sawtooth, pulse, noise). Why? The reason i want to do this,
> is so that the approximation to the input sound can be played on an
> old computer, which can play 3 voices of these simple waveforms, but
> is incapable of playing digitized sounds.
>
> I use FFT with windowing, and thus only approximates small parts of
> the input sound by the 3 waveforms at a time (not the entire sound -
> it would of cause be impossible to approximate anything but the
> simplest input sound by 3 simple waveforms, if the 3 waveforms didn't
> vary over time).
>
> There are some parameters of the 3 waveforms i can vary (freq, volume,
> etc). For the frame F of each burst B of the input sound, i run
> through all values of these parameters (frequency, volume, etc), to
> find which set of parameters best approximates the input sound. For
> each of these sets of parameters, i generate the sound-samples for the
> 3 waveforms, and does the FFT on it to get the frame F'. So now i have
> the two frames F and F' (one for the input sound and one for the
> generated sound). What i want to do, is to compare these two frames,
> and get a score for how similar they are, so that i can find the set
> of parameters that best approximates the burst B.
>
> I have made a simple comparator, to compare the two frames F and F',
> just to test that the rest of the code works. It simply returns a
> score for how well the peaks in F matches the peaks in F' (and ignores
> everything else but the peaks). (and the way it compares the peaks is
> a bit too naive and simple)
>
> This simple method works a bit (it can often follow tones but not
> always), but as i said it's just naive sloppy work to see if the rest
> worked. Before i begin putting too much work into improving it, it
> might be best to get to know what other people have done. Is this the
> best approach to compare two frames? If so, could you point me to some
> literature (websites, etc) about it? If not, how should i compare the

I don't know if this can be done at all, but I'm certain that you can't
do it without being able to define *quantitatively* what you mean by
"similar". I recognized my grandfather's cousin's grandson as a family
member the first time I saw him. (I was anticipating embarrassment at
having forgotten who he was, but it turned out upon introduction that we
had neither met nor known of one another's existence.) Could a computer
program to do that?

Jerry
--
Engineering is the art of making what you want from things you can get.
&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;
```
```If you manage to switch a register of your soundchip fast enough, and if
you manage to have this switching the speaker voltage in a way, you
could do a simple PWM. There were programs that did this with the PC
speaker some time ago, sounded not that bad.
Some also managed to switch the step motor of a floppy with a
PWM-signal, resulting in music out of the floppy drive.

Best regards,

Andre

Jerry Avins wrote:
> Bjarke wrote:
>> Hi DSP experts!
>>
>> I have a question that i hope some of you can help me with. ;)
>>
>> What i want to is: given two audio clips, calculate a score for how
>> similar they are. (how similar they sound)
>>
>> I assume i have to apply the Fourier transformation on the two clips,
>> and somehow analyze the two frames (for example by comparing peaks) to
>> see how similar they are.
>>
>> How should i do this?
>>
>> I will be eternally grateful for any pointers! (ideas, explanations,
>> pointers to literature (websites, etc), ...)
>>
>> Let me just note that i'm very inexperienced with digital signal
>> processing - i don't know much about DSP or DSP terminology. All i've
>> had was a CS course "introduction to digital audio", where i made a
>> FFT algorithm which i used for doing some filters and timestretch.
>>
>> --------------------------------------------------------------------
>>
>> Now that i've asked the question, maybe i should briefly explain what
>> i'm gonna use it for, to give you some idea of what i'm after. This
>> may be boring - if you think so, just skip the rest of this post :).
>>
>> I'm trying to make a program that takes a normal audio clip as input
>> (wav-file) and then approximates the input sound with simple waveforms
>> (triangle, sawtooth, pulse, noise). Why? The reason i want to do this,
>> is so that the approximation to the input sound can be played on an
>> old computer, which can play 3 voices of these simple waveforms, but
>> is incapable of playing digitized sounds.
>>
>> I use FFT with windowing, and thus only approximates small parts of
>> the input sound by the 3 waveforms at a time (not the entire sound -
>> it would of cause be impossible to approximate anything but the
>> simplest input sound by 3 simple waveforms, if the 3 waveforms didn't
>> vary over time).
>>
>> There are some parameters of the 3 waveforms i can vary (freq, volume,
>> etc). For the frame F of each burst B of the input sound, i run
>> through all values of these parameters (frequency, volume, etc), to
>> find which set of parameters best approximates the input sound. For
>> each of these sets of parameters, i generate the sound-samples for the
>> 3 waveforms, and does the FFT on it to get the frame F'. So now i have
>> the two frames F and F' (one for the input sound and one for the
>> generated sound). What i want to do, is to compare these two frames,
>> and get a score for how similar they are, so that i can find the set
>> of parameters that best approximates the burst B.
>>
>> I have made a simple comparator, to compare the two frames F and F',
>> just to test that the rest of the code works. It simply returns a
>> score for how well the peaks in F matches the peaks in F' (and ignores
>> everything else but the peaks). (and the way it compares the peaks is
>> a bit too naive and simple)
>>
>> This simple method works a bit (it can often follow tones but not
>> always), but as i said it's just naive sloppy work to see if the rest
>> worked. Before i begin putting too much work into improving it, it
>> might be best to get to know what other people have done. Is this the
>> best approach to compare two frames? If so, could you point me to some
>> literature (websites, etc) about it? If not, how should i compare the
>
> I don't know if this can be done at all, but I'm certain that you can't
> do it without being able to define *quantitatively* what you mean by
> "similar". I recognized my grandfather's cousin's grandson as a family
> member the first time I saw him. (I was anticipating embarrassment at
> having forgotten who he was, but it turned out upon introduction that we
> had neither met nor known of one another's existence.) Could a computer
> program to do that?
>
> Jerry
```
```HI Bjarke,

In DSP terminology, similarity is detected by using Cross-Correlation.
Use that. Do you have that routine or you will have to develop it?
Many DSP/Communications softwares have such routine. LabVIEW has it.
Matlab has it. and so on.

Pretty neat application.

Let me know if this helps.
Sastry

On Jul 17, 8:54 pm, Bjarke <bjarke....@gmail.com> wrote:
> Hi DSP experts!
>
> I have a question that i hope some of you can help me with. ;)
>
> What i want to is: given two audio clips, calculate a score for how
> similar they are. (how similar they sound)
>
> I assume i have to apply the Fourier transformation on the two clips,
> and somehow analyze the two frames (for example by comparing peaks) to
> see how similar they are.
>
> How should i do this?
>
> I will be eternally grateful for any pointers! (ideas, explanations,
> pointers to literature (websites, etc), ...)
>
> Let me just note that i'm very inexperienced with digital signal
> processing - i don't know much about DSP or DSP terminology. All i've
> had was a CS course "introduction to digital audio", where i made a
> FFT algorithm which i used for doing some filters and timestretch.
>
> --------------------------------------------------------------------
>
> Now that i've asked the question, maybe i should briefly explain what
> i'm gonna use it for, to give you some idea of what i'm after. This
> may be boring - if you think so, just skip the rest of this post :).
>
> I'm trying to make a program that takes a normal audio clip as input
> (wav-file) and then approximates the input sound with simple waveforms
> (triangle, sawtooth, pulse, noise). Why? The reason i want to do this,
> is so that the approximation to the input sound can be played on an
> old computer, which can play 3 voices of these simple waveforms, but
> is incapable of playing digitized sounds.
>
> I use FFT with windowing, and thus only approximates small parts of
> the input sound by the 3 waveforms at a time (not the entire sound -
> it would of cause be impossible to approximate anything but the
> simplest input sound by 3 simple waveforms, if the 3 waveforms didn't
> vary over time).
>
> There are some parameters of the 3 waveforms i can vary (freq, volume,
> etc). For the frame F of each burst B of the input sound, i run
> through all values of these parameters (frequency, volume, etc), to
> find which set of parameters best approximates the input sound. For
> each of these sets of parameters, i generate the sound-samples for the
> 3 waveforms, and does the FFT on it to get the frame F'. So now i have
> the two frames F and F' (one for the input sound and one for the
> generated sound). What i want to do, is to compare these two frames,
> and get a score for how similar they are, so that i can find the set
> of parameters that best approximates the burst B.
>
> I have made a simple comparator, to compare the two frames F and F',
> just to test that the rest of the code works. It simply returns a
> score for how well the peaks in F matches the peaks in F' (and ignores
> everything else but the peaks). (and the way it compares the peaks is
> a bit too naive and simple)
>
> This simple method works a bit (it can often follow tones but not
> always), but as i said it's just naive sloppy work to see if the rest
> worked. Before i begin putting too much work into improving it, it
> might be best to get to know what other people have done. Is this the
> best approach to compare two frames? If so, could you point me to some
> literature (websites, etc) about it? If not, how should i compare the

```
```Sastry wrote:
> HI Bjarke,
>
> In DSP terminology, similarity is detected by using Cross-Correlation.
> Use that. Do you have that routine or you will have to develop it?
> Many DSP/Communications softwares have such routine. LabVIEW has it.
> Matlab has it. and so on.
>
> Pretty neat application.
>
> Let me know if this helps.
> Sastry

I suspect it's not so simple. Phase differences may interfere. maybe
comparing FFTs converted to magnitude without accounting for phase has
merit; I don't really know.

...

Jerry
--
Engineering is the art of making what you want from things you can get.
&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;
```
```>Sastry wrote:
>> HI Bjarke,
>>
>> In DSP terminology, similarity is detected by using Cross-Correlation.
>> Use that. Do you have that routine or you will have to develop it?
>> Many DSP/Communications softwares have such routine. LabVIEW has it.
>> Matlab has it. and so on.
>>
>> Pretty neat application.
>>
>> Let me know if this helps.
>> Sastry
>
>I suspect it's not so simple. Phase differences may interfere. maybe
>comparing FFTs converted to magnitude without accounting for phase has
>merit; I don't really know.
>
>   ...
>
>Jerry
>--
>Engineering is the art of making what you want from things you can get.
>&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;
>

Hi

To compare two audio clips, you can try taking an acoustic fingerprint of
the two clips and then a match. You can find more details about acoustic
fingerprint at
http://en.wikipedia.org/wiki/Acoustic_fingerprint

This may be more than you need.
Hope this helps
Alok

```
```Jerry Avins wrote:
> Sastry wrote:
>> HI Bjarke,
>>
>> In DSP terminology, similarity is detected by using Cross-Correlation.
>> Use that. Do you have that routine or you will have to develop it?
>> Many DSP/Communications softwares have such routine. LabVIEW has it.
>> Matlab has it. and so on.
>>
>> Pretty neat application.
>>
>> Let me know if this helps.
>> Sastry
>
> I suspect it's not so simple. Phase differences may interfere. maybe
> comparing FFTs converted to magnitude without accounting for phase has
> merit; I don't really know.

That has been done before, and works OK.  We ran the magnitude
(squared?) of an FFT through a low-pass filter and then cross-correlated
it with an exemplar.

--
Jim Thomas            Principal Applications Engineer  Bittware, Inc
jthomas@bittware.com  http://www.bittware.com    (603) 226-0404 x536
The secret to enjoying your job is to have a hobby that's even worse
```
```Hi, thanx for the answers!

Sorry for the double post and my late reply. I used the google groups
interface and it seems like it took a day or two before my post showed
up at my place (which is the reason for my double post - i tried

> *** Jerry Avins wrote: ***
> I don't know if this can be done at all, but I'm certain that you can't
> do it without being able to define *quantitatively* what you mean by
> "similar".

Yes, i know. My initial (and probably totally oversimplificated) idea
for a definition is that if the input-burst and the generated-burst
has peaks at the same frequencies and with the same amplitudes, the
sounds sound (almost) the same. Very loosely put, the closer the peaks
in the generated-sound are to the peaks in the input-sound the more
similar the they are, in my (initial simple) definition (and if peaks
are missing in the generated sound, it degrades the score).

I'm still trying to improve exactly how scores should be given and i
don't have too much experience in this field, so perhaps i lack some
theory. I'm sure other people have made much better definitions for
when two sounds sound the same, much superior to mine. The score test-
algorithm i'm making is probably way too simple. At this point, i've
ignored everything but peak frequencies/amplitudes, and i'm still
having problems turning this into good scores. I was hoping some of
you knew of a good way of giving scores, for how similar two sounds
are perceived, or could tell me if i was going in the right direction
or not. :)

About the phases: I've completely ignored phases. I have no control of
the phases of the old computer, and i'm hoping the phases doesn't
matter too much to how sound is perceived - only amplitude of the
frequencies. But maybe i'm wrong? (perhaps you can't ignore
interference?)

> *** Andre wrote: ***
> If you manage to switch a register of your soundchip fast enough, and if
> you manage to have this switching the speaker voltage in a way, you
> could do a simple PWM.

Thanx for the suggestion. Yes, it is possible to do PWM on the
computer (and perhaps on its diskdrive too :)), but only at very low
quality. This is just an experiment to see if it's possible to get
more quality this way. (perhaps at the sacrifice of some similarity to
the original sound - hopefully not too much)

> *** Sastry wrote: ***
> In DSP terminology, similarity is detected by using Cross-Correlation.
> Use that.

but for some reason, i got the impression that cross-correlation were
mostly about finding the time-difference between two signals which are
identical apart from being shifted in time. I came to believe that its
score didn't put much weight on identical frequencies for dominant
peaks, when giving a score (which i think is very important for
audio). But it should work?

Although i doubted it, i tried implementing the formulas from
https://ftirsearch.com/help/algo.htm to see if they would work, a few
days ago, but they didn't work satisfactory at all (should they?). Of
course i now see that it's called "correlation search" and not "cross
correlation search". I will try to look deeper into cross-
correlation. ;)

> Do you have that routine or you will have to develop it?
> Many DSP/Communications softwares have such routine. LabVIEW has it.
> Matlab has it. and so on.

I don't have matlab, so i would have to write it myself or perhaps
find some free already written library (i'm writing this in java).

> *** aloknrao wrote: ***
> To compare two audio clips, you can try taking an acoustic fingerprint of
> the two clips and then a match. You can find more details about acoustic
> fingerprint at http://en.wikipedia.org/wiki/Acoustic_fingerprint
>
> This may be more than you need.

Thanx for the pointer! Sounds very interesting! I will try looking
into that.

Hehe, i must sound like a complete beginner - which i am. ;) I don't
even know if it's possible to approximate a sound by 3 simple
waveforms (and a filter + ringmod) in such way that they sound
(almost) the same, or at least very similar, so i don't know if this
has a chance of working at all or if it's just a waste of time. (is
it?)

Again, if anyone know more of any score algorithm (or theory) for
comparing audio or would like to elaborate on some of the above, i am
still very eager to hear what you have to say :)

```
```Bjarke wrote:

...

> About the phases: I've completely ignored phases. I have no control of
> the phases of the old computer, and i'm hoping the phases doesn't
> matter too much to how sound is perceived - only amplitude of the
> frequencies. But maybe i'm wrong? (perhaps you can't ignore
> interference?)

...

Phase can be a problem for you because, while it has practically no
effect on the perceived sound, changes can create major difference in
wave shape. An analysis tool like FFT magnitude that ignores phase
sidesteps the problem. Looking at the waveshape directly burdens you
with it.

Jerry
--
Engineering is the art of making what you want from things you can get.
&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;
```