DSPRelated.com
Forums

How to compare two audio clips for similarity?

Started by Bjarke July 17, 2007
Hi DSP experts!

I have a question that i hope some of you can help me with. ;)

What i want to is: given two audio clips, calculate a score for how
similar they are. (how similar they sound)

I assume i have to apply the Fourier transformation on the two clips,
and somehow analyze the two frames (for example by comparing peaks) to
see how similar they are.

How should i do this?

I will be eternally grateful for any pointers! (ideas, explanations,
pointers to literature (websites, etc), ...)

Let me just note that i'm very inexperienced with digital signal
processing - i don't know much about DSP or DSP terminology. All i've
had was a CS course "introduction to digital audio", where i made a
FFT algorithm which i used for doing some filters and timestretch.

--------------------------------------------------------------------

Now that i've asked the question, maybe i should briefly explain what
i'm gonna use it for, to give you some idea of what i'm after. This
may be boring - if you think so, just skip the rest of this post :).

I'm trying to make a program that takes a normal audio clip as input
(wav-file) and then approximates the input sound with simple waveforms
(triangle, sawtooth, pulse, noise). Why? The reason i want to do this,
is so that the approximation to the input sound can be played on an
old computer, which can play 3 voices of these simple waveforms, but
is incapable of playing digitized sounds.

I use FFT with windowing, and thus only approximates small parts of
the input sound by the 3 waveforms at a time (not the entire sound -
it would of cause be impossible to approximate anything but the
simplest input sound by 3 simple waveforms, if the 3 waveforms didn't
vary over time).

There are some parameters of the 3 waveforms i can vary (freq, volume,
etc). For the frame F of each burst B of the input sound, i run
through all values of these parameters (frequency, volume, etc), to
find which set of parameters best approximates the input sound. For
each of these sets of parameters, i generate the sound-samples for the
3 waveforms, and does the FFT on it to get the frame F'. So now i have
the two frames F and F' (one for the input sound and one for the
generated sound). What i want to do, is to compare these two frames,
and get a score for how similar they are, so that i can find the set
of parameters that best approximates the burst B.

I have made a simple comparator, to compare the two frames F and F',
just to test that the rest of the code works. It simply returns a
score for how well the peaks in F matches the peaks in F' (and ignores
everything else but the peaks). (and the way it compares the peaks is
a bit too naive and simple)

This simple method works a bit (it can often follow tones but not
always), but as i said it's just naive sloppy work to see if the rest
worked. Before i begin putting too much work into improving it, it
might be best to get to know what other people have done. Is this the
best approach to compare two frames? If so, could you point me to some
literature (websites, etc) about it? If not, how should i compare the
two frames instead?

Bjarke wrote:
> Hi DSP experts! > > I have a question that i hope some of you can help me with. ;) > > What i want to is: given two audio clips, calculate a score for how > similar they are. (how similar they sound) > > I assume i have to apply the Fourier transformation on the two clips, > and somehow analyze the two frames (for example by comparing peaks) to > see how similar they are. > > How should i do this? > > I will be eternally grateful for any pointers! (ideas, explanations, > pointers to literature (websites, etc), ...) > > Let me just note that i'm very inexperienced with digital signal > processing - i don't know much about DSP or DSP terminology. All i've > had was a CS course "introduction to digital audio", where i made a > FFT algorithm which i used for doing some filters and timestretch. > > -------------------------------------------------------------------- > > Now that i've asked the question, maybe i should briefly explain what > i'm gonna use it for, to give you some idea of what i'm after. This > may be boring - if you think so, just skip the rest of this post :). > > I'm trying to make a program that takes a normal audio clip as input > (wav-file) and then approximates the input sound with simple waveforms > (triangle, sawtooth, pulse, noise). Why? The reason i want to do this, > is so that the approximation to the input sound can be played on an > old computer, which can play 3 voices of these simple waveforms, but > is incapable of playing digitized sounds. > > I use FFT with windowing, and thus only approximates small parts of > the input sound by the 3 waveforms at a time (not the entire sound - > it would of cause be impossible to approximate anything but the > simplest input sound by 3 simple waveforms, if the 3 waveforms didn't > vary over time). > > There are some parameters of the 3 waveforms i can vary (freq, volume, > etc). For the frame F of each burst B of the input sound, i run > through all values of these parameters (frequency, volume, etc), to > find which set of parameters best approximates the input sound. For > each of these sets of parameters, i generate the sound-samples for the > 3 waveforms, and does the FFT on it to get the frame F'. So now i have > the two frames F and F' (one for the input sound and one for the > generated sound). What i want to do, is to compare these two frames, > and get a score for how similar they are, so that i can find the set > of parameters that best approximates the burst B. > > I have made a simple comparator, to compare the two frames F and F', > just to test that the rest of the code works. It simply returns a > score for how well the peaks in F matches the peaks in F' (and ignores > everything else but the peaks). (and the way it compares the peaks is > a bit too naive and simple) > > This simple method works a bit (it can often follow tones but not > always), but as i said it's just naive sloppy work to see if the rest > worked. Before i begin putting too much work into improving it, it > might be best to get to know what other people have done. Is this the > best approach to compare two frames? If so, could you point me to some > literature (websites, etc) about it? If not, how should i compare the > two frames instead?
I don't know if this can be done at all, but I'm certain that you can't do it without being able to define *quantitatively* what you mean by "similar". I recognized my grandfather's cousin's grandson as a family member the first time I saw him. (I was anticipating embarrassment at having forgotten who he was, but it turned out upon introduction that we had neither met nor known of one another's existence.) Could a computer program to do that? Jerry -- Engineering is the art of making what you want from things you can get. ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
If you manage to switch a register of your soundchip fast enough, and if 
you manage to have this switching the speaker voltage in a way, you 
could do a simple PWM. There were programs that did this with the PC 
speaker some time ago, sounded not that bad.
Some also managed to switch the step motor of a floppy with a 
PWM-signal, resulting in music out of the floppy drive.

Best regards,

Andre


Jerry Avins wrote:
> Bjarke wrote: >> Hi DSP experts! >> >> I have a question that i hope some of you can help me with. ;) >> >> What i want to is: given two audio clips, calculate a score for how >> similar they are. (how similar they sound) >> >> I assume i have to apply the Fourier transformation on the two clips, >> and somehow analyze the two frames (for example by comparing peaks) to >> see how similar they are. >> >> How should i do this? >> >> I will be eternally grateful for any pointers! (ideas, explanations, >> pointers to literature (websites, etc), ...) >> >> Let me just note that i'm very inexperienced with digital signal >> processing - i don't know much about DSP or DSP terminology. All i've >> had was a CS course "introduction to digital audio", where i made a >> FFT algorithm which i used for doing some filters and timestretch. >> >> -------------------------------------------------------------------- >> >> Now that i've asked the question, maybe i should briefly explain what >> i'm gonna use it for, to give you some idea of what i'm after. This >> may be boring - if you think so, just skip the rest of this post :). >> >> I'm trying to make a program that takes a normal audio clip as input >> (wav-file) and then approximates the input sound with simple waveforms >> (triangle, sawtooth, pulse, noise). Why? The reason i want to do this, >> is so that the approximation to the input sound can be played on an >> old computer, which can play 3 voices of these simple waveforms, but >> is incapable of playing digitized sounds. >> >> I use FFT with windowing, and thus only approximates small parts of >> the input sound by the 3 waveforms at a time (not the entire sound - >> it would of cause be impossible to approximate anything but the >> simplest input sound by 3 simple waveforms, if the 3 waveforms didn't >> vary over time). >> >> There are some parameters of the 3 waveforms i can vary (freq, volume, >> etc). For the frame F of each burst B of the input sound, i run >> through all values of these parameters (frequency, volume, etc), to >> find which set of parameters best approximates the input sound. For >> each of these sets of parameters, i generate the sound-samples for the >> 3 waveforms, and does the FFT on it to get the frame F'. So now i have >> the two frames F and F' (one for the input sound and one for the >> generated sound). What i want to do, is to compare these two frames, >> and get a score for how similar they are, so that i can find the set >> of parameters that best approximates the burst B. >> >> I have made a simple comparator, to compare the two frames F and F', >> just to test that the rest of the code works. It simply returns a >> score for how well the peaks in F matches the peaks in F' (and ignores >> everything else but the peaks). (and the way it compares the peaks is >> a bit too naive and simple) >> >> This simple method works a bit (it can often follow tones but not >> always), but as i said it's just naive sloppy work to see if the rest >> worked. Before i begin putting too much work into improving it, it >> might be best to get to know what other people have done. Is this the >> best approach to compare two frames? If so, could you point me to some >> literature (websites, etc) about it? If not, how should i compare the >> two frames instead? > > I don't know if this can be done at all, but I'm certain that you can't > do it without being able to define *quantitatively* what you mean by > "similar". I recognized my grandfather's cousin's grandson as a family > member the first time I saw him. (I was anticipating embarrassment at > having forgotten who he was, but it turned out upon introduction that we > had neither met nor known of one another's existence.) Could a computer > program to do that? > > Jerry
HI Bjarke,

In DSP terminology, similarity is detected by using Cross-Correlation.
Use that. Do you have that routine or you will have to develop it?
Many DSP/Communications softwares have such routine. LabVIEW has it.
Matlab has it. and so on.

Pretty neat application.

Let me know if this helps.
Sastry

On Jul 17, 8:54 pm, Bjarke <bjarke....@gmail.com> wrote:
> Hi DSP experts! > > I have a question that i hope some of you can help me with. ;) > > What i want to is: given two audio clips, calculate a score for how > similar they are. (how similar they sound) > > I assume i have to apply the Fourier transformation on the two clips, > and somehow analyze the two frames (for example by comparing peaks) to > see how similar they are. > > How should i do this? > > I will be eternally grateful for any pointers! (ideas, explanations, > pointers to literature (websites, etc), ...) > > Let me just note that i'm very inexperienced with digital signal > processing - i don't know much about DSP or DSP terminology. All i've > had was a CS course "introduction to digital audio", where i made a > FFT algorithm which i used for doing some filters and timestretch. > > -------------------------------------------------------------------- > > Now that i've asked the question, maybe i should briefly explain what > i'm gonna use it for, to give you some idea of what i'm after. This > may be boring - if you think so, just skip the rest of this post :). > > I'm trying to make a program that takes a normal audio clip as input > (wav-file) and then approximates the input sound with simple waveforms > (triangle, sawtooth, pulse, noise). Why? The reason i want to do this, > is so that the approximation to the input sound can be played on an > old computer, which can play 3 voices of these simple waveforms, but > is incapable of playing digitized sounds. > > I use FFT with windowing, and thus only approximates small parts of > the input sound by the 3 waveforms at a time (not the entire sound - > it would of cause be impossible to approximate anything but the > simplest input sound by 3 simple waveforms, if the 3 waveforms didn't > vary over time). > > There are some parameters of the 3 waveforms i can vary (freq, volume, > etc). For the frame F of each burst B of the input sound, i run > through all values of these parameters (frequency, volume, etc), to > find which set of parameters best approximates the input sound. For > each of these sets of parameters, i generate the sound-samples for the > 3 waveforms, and does the FFT on it to get the frame F'. So now i have > the two frames F and F' (one for the input sound and one for the > generated sound). What i want to do, is to compare these two frames, > and get a score for how similar they are, so that i can find the set > of parameters that best approximates the burst B. > > I have made a simple comparator, to compare the two frames F and F', > just to test that the rest of the code works. It simply returns a > score for how well the peaks in F matches the peaks in F' (and ignores > everything else but the peaks). (and the way it compares the peaks is > a bit too naive and simple) > > This simple method works a bit (it can often follow tones but not > always), but as i said it's just naive sloppy work to see if the rest > worked. Before i begin putting too much work into improving it, it > might be best to get to know what other people have done. Is this the > best approach to compare two frames? If so, could you point me to some > literature (websites, etc) about it? If not, how should i compare the > two frames instead?
Sastry wrote:
> HI Bjarke, > > In DSP terminology, similarity is detected by using Cross-Correlation. > Use that. Do you have that routine or you will have to develop it? > Many DSP/Communications softwares have such routine. LabVIEW has it. > Matlab has it. and so on. > > Pretty neat application. > > Let me know if this helps. > Sastry
I suspect it's not so simple. Phase differences may interfere. maybe comparing FFTs converted to magnitude without accounting for phase has merit; I don't really know. ... Jerry -- Engineering is the art of making what you want from things you can get. &macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;
>Sastry wrote: >> HI Bjarke, >> >> In DSP terminology, similarity is detected by using Cross-Correlation. >> Use that. Do you have that routine or you will have to develop it? >> Many DSP/Communications softwares have such routine. LabVIEW has it. >> Matlab has it. and so on. >> >> Pretty neat application. >> >> Let me know if this helps. >> Sastry > >I suspect it's not so simple. Phase differences may interfere. maybe >comparing FFTs converted to magnitude without accounting for phase has >merit; I don't really know. > > ... > >Jerry >-- >Engineering is the art of making what you want from things you can get. >&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr; >
Hi To compare two audio clips, you can try taking an acoustic fingerprint of the two clips and then a match. You can find more details about acoustic fingerprint at http://en.wikipedia.org/wiki/Acoustic_fingerprint This may be more than you need. Hope this helps Alok
Jerry Avins wrote:
> Sastry wrote: >> HI Bjarke, >> >> In DSP terminology, similarity is detected by using Cross-Correlation. >> Use that. Do you have that routine or you will have to develop it? >> Many DSP/Communications softwares have such routine. LabVIEW has it. >> Matlab has it. and so on. >> >> Pretty neat application. >> >> Let me know if this helps. >> Sastry > > I suspect it's not so simple. Phase differences may interfere. maybe > comparing FFTs converted to magnitude without accounting for phase has > merit; I don't really know.
That has been done before, and works OK. We ran the magnitude (squared?) of an FFT through a low-pass filter and then cross-correlated it with an exemplar. -- Jim Thomas Principal Applications Engineer Bittware, Inc jthomas@bittware.com http://www.bittware.com (603) 226-0404 x536 The secret to enjoying your job is to have a hobby that's even worse - Calvin's Dad
Hi, thanx for the answers!

Sorry for the double post and my late reply. I used the google groups
interface and it seems like it took a day or two before my post showed
up at my place (which is the reason for my double post - i tried
posting again...). But i'm glad to see all your answers here!

> *** Jerry Avins wrote: *** > I don't know if this can be done at all, but I'm certain that you can't > do it without being able to define *quantitatively* what you mean by > "similar".
Yes, i know. My initial (and probably totally oversimplificated) idea for a definition is that if the input-burst and the generated-burst has peaks at the same frequencies and with the same amplitudes, the sounds sound (almost) the same. Very loosely put, the closer the peaks in the generated-sound are to the peaks in the input-sound the more similar the they are, in my (initial simple) definition (and if peaks are missing in the generated sound, it degrades the score). I'm still trying to improve exactly how scores should be given and i don't have too much experience in this field, so perhaps i lack some theory. I'm sure other people have made much better definitions for when two sounds sound the same, much superior to mine. The score test- algorithm i'm making is probably way too simple. At this point, i've ignored everything but peak frequencies/amplitudes, and i'm still having problems turning this into good scores. I was hoping some of you knew of a good way of giving scores, for how similar two sounds are perceived, or could tell me if i was going in the right direction or not. :) About the phases: I've completely ignored phases. I have no control of the phases of the old computer, and i'm hoping the phases doesn't matter too much to how sound is perceived - only amplitude of the frequencies. But maybe i'm wrong? (perhaps you can't ignore interference?)
> *** Andre wrote: *** > If you manage to switch a register of your soundchip fast enough, and if > you manage to have this switching the speaker voltage in a way, you > could do a simple PWM.
Thanx for the suggestion. Yes, it is possible to do PWM on the computer (and perhaps on its diskdrive too :)), but only at very low quality. This is just an experiment to see if it's possible to get more quality this way. (perhaps at the sacrifice of some similarity to the original sound - hopefully not too much)
> *** Sastry wrote: *** > In DSP terminology, similarity is detected by using Cross-Correlation. > Use that.
Thanx for the pointer to cross-correlation. I read about this before, but for some reason, i got the impression that cross-correlation were mostly about finding the time-difference between two signals which are identical apart from being shifted in time. I came to believe that its score didn't put much weight on identical frequencies for dominant peaks, when giving a score (which i think is very important for audio). But it should work? Although i doubted it, i tried implementing the formulas from https://ftirsearch.com/help/algo.htm to see if they would work, a few days ago, but they didn't work satisfactory at all (should they?). Of course i now see that it's called "correlation search" and not "cross correlation search". I will try to look deeper into cross- correlation. ;)
> Do you have that routine or you will have to develop it? > Many DSP/Communications softwares have such routine. LabVIEW has it. > Matlab has it. and so on.
I don't have matlab, so i would have to write it myself or perhaps find some free already written library (i'm writing this in java).
> *** aloknrao wrote: *** > To compare two audio clips, you can try taking an acoustic fingerprint of > the two clips and then a match. You can find more details about acoustic > fingerprint at http://en.wikipedia.org/wiki/Acoustic_fingerprint > > This may be more than you need.
Thanx for the pointer! Sounds very interesting! I will try looking into that. Hehe, i must sound like a complete beginner - which i am. ;) I don't even know if it's possible to approximate a sound by 3 simple waveforms (and a filter + ringmod) in such way that they sound (almost) the same, or at least very similar, so i don't know if this has a chance of working at all or if it's just a waste of time. (is it?) Again, if anyone know more of any score algorithm (or theory) for comparing audio or would like to elaborate on some of the above, i am still very eager to hear what you have to say :) Thanx for all your answers!
Bjarke wrote:

   ...

> About the phases: I've completely ignored phases. I have no control of > the phases of the old computer, and i'm hoping the phases doesn't > matter too much to how sound is perceived - only amplitude of the > frequencies. But maybe i'm wrong? (perhaps you can't ignore > interference?)
... Phase can be a problem for you because, while it has practically no effect on the perceived sound, changes can create major difference in wave shape. An analysis tool like FFT magnitude that ignores phase sidesteps the problem. Looking at the waveshape directly burdens you with it. Jerry -- Engineering is the art of making what you want from things you can get. &macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;