comp.dsp | Sound Identification/Matching

Can anyone recommend a good starting point for creating code that does
Sound Identification/Matching?  I was going to start creating a
library of FFT snapshots consisting of varying time window lengths for
the sounds I want to identify in audio streams.  Then I was going to
start matching up sounds that way by comparing snapshots taken from
incoming audio seeing how close they are to the library, but  then I
thought I'd better ask to see if there any known techniques, papers,
etc. that would help me avoid reinventing the wheel.  Anything with
source code examples would be appreciated.

I know there's a lot of stuff out there, but I'm not technical enough
to quickly sift out the best technique or to understand the possible
caveats of using one technique over another; especially if the answer
is given in high level math form.  That's why I'm asking for some
tips.

Thanks,
Robert

Reply by Rune Allnor ●November 22, 20082008-11-22

On 22 Nov, 13:54, roschler <robert.osch...@gmail.com> wrote:
> Can anyone recommend a good starting point for creating code that does
> Sound Identification/Matching? &#4294967295;I was going to start creating a
> library of FFT snapshots consisting of varying time window lengths for
> the sounds I want to identify in audio streams. &#4294967295;Then I was going to
> start matching up sounds that way by comparing snapshots taken from
> incoming audio seeing how close they are to the library, but &#4294967295;then I
> thought I'd better ask to see if there any known techniques, papers,
> etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with
> source code examples would be appreciated.
>
> I know there's a lot of stuff out there, but I'm not technical enough
> to quickly sift out the best technique or to understand the possible
> caveats of using one technique over another; especially if the answer
> is given in high level math form. &#4294967295;That's why I'm asking for some
> tips.

Right... have you seen any products or services out there
which actually do what you attempt? Or did you see some
cool gadget on some CSI show? If you've actually seen something,
why don't you contact the creators of these services and ask
what techniques they used?

Rune

Reply by dbell ●November 22, 20082008-11-22

On Nov 22, 7:54&#4294967295;am, roschler <robert.osch...@gmail.com> wrote:
> Can anyone recommend a good starting point for creating code that does
> Sound Identification/Matching? &#4294967295;I was going to start creating a
> library of FFT snapshots consisting of varying time window lengths for
> the sounds I want to identify in audio streams. &#4294967295;Then I was going to
> start matching up sounds that way by comparing snapshots taken from
> incoming audio seeing how close they are to the library, but &#4294967295;then I
> thought I'd better ask to see if there any known techniques, papers,
> etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with
> source code examples would be appreciated.
>
> I know there's a lot of stuff out there, but I'm not technical enough
> to quickly sift out the best technique or to understand the possible
> caveats of using one technique over another; especially if the answer
> is given in high level math form. &#4294967295;That's why I'm asking for some
> tips.
>
> Thanks,
> Robert

Robert,

How do you capture the sounds?
Are you working from recordings?
What kind of sounds?
In what kind of environment (recordingstudio, quiet room, public room,
jail cell, street, shopping mall...)?
Are there competing sounds (speakers, music, traffic, ...)?
Is there reverberation? How about some other filtering affect on the
sounds?
..., ...
There are so many things that will affect your FFTs.

Why have you picked FFTs?

How do you plan to match the FFTs?

Is this a real project (for government, industry) or a toy project
(for a class)?

The first step is to define your requirements.

Dirk

Reply by SteveSmith ●November 22, 20082008-11-22

Hi Robert,

First, ignore Rune sarcastic comments- he aspires to be mediocre. 

What you are trying to do is not easy; it is pushing the limits of what
current technology can do.  Nevertheless, your brain can do it, so it is
certainly possible.  As an example, suppose you want to recognize the sound
of a brick being dropped on the ground.  You might start by dropping 100
different kinds of bricks, recording the sounds, and then comparing the
data.    From hearing these 100 sounds your brain can easily learn to
recognize the event.  However, when you graph the 100 sound waves they all
look different.  In other words, you cannot recognize a sound simply by
matching the sound wave with some previously recorded sound wave.  To solve
this problem, you need to find what these 100 sound waves have in common. 
These commonalities are what define the &ldquo;sound of a falling brick.&rdquo;  
These are such things as the frequency content, the duration, the onset
abruptness, ending decay, and so on.   If you want to fool around in this
area, start with extremely simple tasks, and you might see some success. 
Good luck.
Steve

Reply by Rune Allnor ●November 22, 20082008-11-22

On 22 Nov, 19:12, "SteveSmith" <Steve.Smi...@SpectrumSDI.com> wrote:
> Hi Robert,
>
> First, ignore Rune sarcastic comments- he aspires to be mediocre.

Well, I certainly don't point other people into wasting time
(and possibly carreers) on BS.

> start with extremely simple tasks,

That's about the only sane thing you've said all week...

> and you might see some success.

... why the cautious term "might"? If you're so certain
why not a more confident statement? It's not *your* time
(or carreer) at stake?

Whimp.

Rune

Reply by SteveSmith ●November 22, 20082008-11-22

Ouch again!  Once more I have been put in my place by your sharp wit and
compelling arguments.

Reply by Fred Marshall ●November 22, 20082008-11-22

roschler wrote:
> Can anyone recommend a good starting point for creating code that does
> Sound Identification/Matching?  I was going to start creating a
> library of FFT snapshots consisting of varying time window lengths for
> the sounds I want to identify in audio streams.  Then I was going to
> start matching up sounds that way by comparing snapshots taken from
> incoming audio seeing how close they are to the library, but  then I
> thought I'd better ask to see if there any known techniques, papers,
> etc. that would help me avoid reinventing the wheel.  Anything with
> source code examples would be appreciated.
>
> I know there's a lot of stuff out there, but I'm not technical enough
> to quickly sift out the best technique or to understand the possible
> caveats of using one technique over another; especially if the answer
> is given in high level math form.  That's why I'm asking for some
> tips.

Robert,

Well, I have some experience with this - although in a different application 
area.  Each application area will have its challenges and likely some things 
that are easier to deal with as compared to other applications.

A team of US experts assessed 20 years of computer generated sound matching 
in difficult situations.  They did this because there were plenty of 
examples of failures as well as examples of some kinds of success.

I would suggest these things to get you started (or to discourage you):

- There will be false alarms/alerts and there will be false rests/misses. 
One works against the other.  Expect that a high false alarm rate will cause 
the system to be practically useless.  Ask yourself how expensive a false 
rest will be.  You will have to give up alerts in order to make the false 
alert rate be acceptable.  At what point is the false rest rate 
unacceptable?  You don't have to build a system to postulate what your 
operating requirements are going to be.  This is a very big deal and, in 
some sense, easy to decide.  What's hard is knowing what any system will 
yield in this regard.

- If starting from scratch, you will need to decide what parameters of the 
sound to extract in order to do the comparison.  Consider this:  "the larger 
the number of parameters, the higher the necessary signal to noise ratio". 
Why is that?  Because you will naturally consider the "evident" parameters 
first.  They are evident because their signal to noise ratio is high.  As 
you add parameters, it's likely that thier individual SNRs will be lower 
than the initally-selected ones.  Over all, this pushes up the total 
required SNR if you're going to make good use of them.  Eventually this can 
make the system unusable in important situations.  This suggests keeping the 
number of parameters suitably low.

- As part of the parameter extraction process you need to decide on suitable 
temporal epochs.  What's the likelihood of identifying Lincoln's Gettysburg 
Address as compared to a sneeze or a sneeze as compared to a phrase "help 
me!"  The objectives here need to make reasonable sense.

- As part of the parameter extraction process you can consider spectral 
character and temporal character and the combination thereof.

_ Consider the difference between:
  .. signal detection (as in the fundamental process of receiving and 
possibly prefiltering a signal in noise)
  .. parameter extraction (getting the temporal and spectral and maybe 
spatial details as a set of numbers / measures)
  .. signal classification (deciding how the parameter measures stack up 
against a standard)
A matched filter for a radar may do all 3 more or less at the same time. 
Building a matched filter for a particular voiced phrase may be a lot 
harder.  Building an effective classifier can be pretty difficult.

In speech, there are similar terms referred to in:
http://en.wikipedia.org/wiki/Speech_recognition
The fundamentals are pretty much the same though...

It's not impossible.  We have all interacted with voice-actuated telephone 
menu systems haven't we?  There, there's a limited dictionary of one or 
two-word phrases.  Say the wrong thing and the system doesn't recognize - 
fails (in a good sense) to classify the phrase as being in the dictionary. 
No matter what, it all boils down to those considerations above.

My sense is that you might be working in a high SNR situation like the voice 
menus?  That makes it easier.

Have you considered something like Dragon Speaking?  Voice to text already 
works.  Then text to dictionary??

That it might be challenging can be inferred from Wikipedia's comment in 
Speech Recognition in Linux:
http://en.wikipedia.org/wiki/Speech_recognition_in_Linux

"There is currently no open-source equivalent of proprietary speech 
recognition software (e.g. Nuances Dragon NaturallySpeaking or Windows 
Speech Recognition) for Linux."

Gee, with all the great efforts in open-source, one can but wonder why?  But 
it appears there *are* open-source pieces.  They're mentioned in the Linux 
link above.

Fred

Reply by HardySpicer ●November 22, 20082008-11-22

On Nov 23, 9:31&#4294967295;am, "Fred Marshall" <fmarshallx@remove_the_x.acm.org>
wrote:
> roschler wrote:
> > Can anyone recommend a good starting point for creating code that does
> > Sound Identification/Matching? &#4294967295;I was going to start creating a
> > library of FFT snapshots consisting of varying time window lengths for
> > the sounds I want to identify in audio streams. &#4294967295;Then I was going to
> > start matching up sounds that way by comparing snapshots taken from
> > incoming audio seeing how close they are to the library, but &#4294967295;then I
> > thought I'd better ask to see if there any known techniques, papers,
> > etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with
> > source code examples would be appreciated.
>
> > I know there's a lot of stuff out there, but I'm not technical enough
> > to quickly sift out the best technique or to understand the possible
> > caveats of using one technique over another; especially if the answer
> > is given in high level math form. &#4294967295;That's why I'm asking for some
> > tips.
>
> Robert,
>
> Well, I have some experience with this - although in a different application
> area. &#4294967295;Each application area will have its challenges and likely some things
> that are easier to deal with as compared to other applications.
>
> A team of US experts assessed 20 years of computer generated sound matching
> in difficult situations. &#4294967295;They did this because there were plenty of
> examples of failures as well as examples of some kinds of success.
>
> I would suggest these things to get you started (or to discourage you):
>
> - There will be false alarms/alerts and there will be false rests/misses.
> One works against the other. &#4294967295;Expect that a high false alarm rate will cause
> the system to be practically useless. &#4294967295;Ask yourself how expensive a false
> rest will be. &#4294967295;You will have to give up alerts in order to make the false
> alert rate be acceptable. &#4294967295;At what point is the false rest rate
> unacceptable? &#4294967295;You don't have to build a system to postulate what your
> operating requirements are going to be. &#4294967295;This is a very big deal and, in
> some sense, easy to decide. &#4294967295;What's hard is knowing what any system will
> yield in this regard.
>
> - If starting from scratch, you will need to decide what parameters of the
> sound to extract in order to do the comparison. &#4294967295;Consider this: &#4294967295;"the larger
> the number of parameters, the higher the necessary signal to noise ratio".
> Why is that? &#4294967295;Because you will naturally consider the "evident" parameters
> first. &#4294967295;They are evident because their signal to noise ratio is high. &#4294967295;As
> you add parameters, it's likely that thier individual SNRs will be lower
> than the initally-selected ones. &#4294967295;Over all, this pushes up the total
> required SNR if you're going to make good use of them. &#4294967295;Eventually this can
> make the system unusable in important situations. &#4294967295;This suggests keeping the
> number of parameters suitably low.
>
> - As part of the parameter extraction process you need to decide on suitable
> temporal epochs. &#4294967295;What's the likelihood of identifying Lincoln's Gettysburg
> Address as compared to a sneeze or a sneeze as compared to a phrase "help
> me!" &#4294967295;The objectives here need to make reasonable sense.
>
> - As part of the parameter extraction process you can consider spectral
> character and temporal character and the combination thereof.
>
> _ Consider the difference between:
> &#4294967295; .. signal detection (as in the fundamental process of receiving and
> possibly prefiltering a signal in noise)
> &#4294967295; .. parameter extraction (getting the temporal and spectral and maybe
> spatial details as a set of numbers / measures)
> &#4294967295; .. signal classification (deciding how the parameter measures stack up
> against a standard)
> A matched filter for a radar may do all 3 more or less at the same time.
> Building a matched filter for a particular voiced phrase may be a lot
> harder. &#4294967295;Building an effective classifier can be pretty difficult.
>
> In speech, there are similar terms referred to in:http://en.wikipedia.org/wiki/Speech_recognition
> The fundamentals are pretty much the same though...
>
> It's not impossible. &#4294967295;We have all interacted with voice-actuated telephone
> menu systems haven't we? &#4294967295;There, there's a limited dictionary of one or
> two-word phrases. &#4294967295;Say the wrong thing and the system doesn't recognize -
> fails (in a good sense) to classify the phrase as being in the dictionary.
> No matter what, it all boils down to those considerations above.
>
> My sense is that you might be working in a high SNR situation like the voice
> menus? &#4294967295;That makes it easier.
>
> Have you considered something like Dragon Speaking? &#4294967295;Voice to text already
> works. &#4294967295;Then text to dictionary??
>
> That it might be challenging can be inferred from Wikipedia's comment in
> Speech Recognition in Linux:http://en.wikipedia.org/wiki/Speech_recognition_in_Linux
>
> "There is currently no open-source equivalent of proprietary speech
> recognition software (e.g. Nuances Dragon NaturallySpeaking or Windows
> Speech Recognition) for Linux."
>
> Gee, with all the great efforts in open-source, one can but wonder why? &#4294967295;But
> it appears there *are* open-source pieces. &#4294967295;They're mentioned in the Linux
> link above.
>
> Fred

I would look at Speech recognition methods as a starting point and
even voice identification.
Of course it's possible - not easy - but possible.In fact bloody hard.
Good luck - should take about 10 years minimum.

Hardy

Reply by Jerry Avins ●November 22, 20082008-11-22

SteveSmith wrote:
> Ouch again!  Once more I have been put in my place by your sharp wit and
> compelling arguments.   

We all know that alternate sets of basis functions can serve for 
decomposition and reconstitution. Just as sinusoids or exponentials 
serve as basis sets for the Fourier transform, Walsh functions are the 
basis set for the Hadamard transform. Harr is to Walsh as wavelet is to 
sinusoid, and so it goes. Clearly, some basis sets suit some problem 
domains better than others, notwithstanding that all are essentially 
interchangeable.

I don't think that Rune is unaware of any of this. What sticks in his 
craw is the notion that a basis set somehow derived from the physics of 
the problem at hand will provide some holy grail of analysis. While I 
basically agree with him, I haven't suffered the evidently traumatic 
experience that gives him his fervor.

Consider the "Fourier" transform we all know and live. As originally 
propounded, is involved sines and cosines. Sines and cosines of any 
frequency are orthogonal to those of any other, and sines are orthogonal 
to cosines. Moreover, any continuous expression can be decomposed (over 
a finite interval) into a finite set of sines and cosines. Good so far.

We can, for many purposes, adopt a better set for that purpose: complex 
exponentials. Instead of two types, one suffices because of the usual 
two-for-one complex vis-a-vis real sleight of hand. The manipulative 
simplicity of the exponential basis set -- we call it a form, but it is 
in fact a different but equivalent set -- comes at the cost of time or 
frequency needing to be a signed quantity. Conceptual simplicity is 
traded for computational simplicity. Other basis sets trade other 
attributes. I doubt that it will ever be useful to characterize sounds 
as superpositions and transpositions of bird calls, although it wouldn't 
surprise me if it were possible to do that.

Jerry
-- 
Engineering is the art of making what you want from things you can get.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by Rune Allnor ●November 22, 20082008-11-22

On 22 Nov, 20:31, "SteveSmith" <Steve.Smi...@SpectrumSDI.com> wrote:
> Ouch again! &#4294967295;Once more I have been put in my place by your sharp wit and
> compelling arguments. &#4294967295;

There is a long and honorable tradition in a field I have
been very peripherically in touch with - submarines - that
the responsible designers and builders of new subs join
the crew on the vessel's maiden voyage.

I can think of two reasons for that: Primarily that the
crew sees that the builders have confidence in their own
work. And secondly, in the case such confidence is misplaced,
that the responible parties don't get a chance to jeopardize
another crew, but perish themselves with their flawed work.

These are attitudes that are well observed elswehere as
well: Don't encourage others to do what you don't want
do or is unable to do yourself. Such behaviours are
generally frowned upon - there are even psychiatric
diagnoses for some of them.

Rune

Previous12 3 Next

Sound Identification/Matching - good starting point?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group