DSPRelated.com
Forums

Sound Identification/Matching - good starting point?

Started by roschler November 22, 2008
Can anyone recommend a good starting point for creating code that does
Sound Identification/Matching?  I was going to start creating a
library of FFT snapshots consisting of varying time window lengths for
the sounds I want to identify in audio streams.  Then I was going to
start matching up sounds that way by comparing snapshots taken from
incoming audio seeing how close they are to the library, but  then I
thought I'd better ask to see if there any known techniques, papers,
etc. that would help me avoid reinventing the wheel.  Anything with
source code examples would be appreciated.

I know there's a lot of stuff out there, but I'm not technical enough
to quickly sift out the best technique or to understand the possible
caveats of using one technique over another; especially if the answer
is given in high level math form.  That's why I'm asking for some
tips.

Thanks,
Robert
On 22 Nov, 13:54, roschler <robert.osch...@gmail.com> wrote:
> Can anyone recommend a good starting point for creating code that does > Sound Identification/Matching? &#4294967295;I was going to start creating a > library of FFT snapshots consisting of varying time window lengths for > the sounds I want to identify in audio streams. &#4294967295;Then I was going to > start matching up sounds that way by comparing snapshots taken from > incoming audio seeing how close they are to the library, but &#4294967295;then I > thought I'd better ask to see if there any known techniques, papers, > etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with > source code examples would be appreciated. > > I know there's a lot of stuff out there, but I'm not technical enough > to quickly sift out the best technique or to understand the possible > caveats of using one technique over another; especially if the answer > is given in high level math form. &#4294967295;That's why I'm asking for some > tips.
Right... have you seen any products or services out there which actually do what you attempt? Or did you see some cool gadget on some CSI show? If you've actually seen something, why don't you contact the creators of these services and ask what techniques they used? Rune
On Nov 22, 7:54&#4294967295;am, roschler <robert.osch...@gmail.com> wrote:
> Can anyone recommend a good starting point for creating code that does > Sound Identification/Matching? &#4294967295;I was going to start creating a > library of FFT snapshots consisting of varying time window lengths for > the sounds I want to identify in audio streams. &#4294967295;Then I was going to > start matching up sounds that way by comparing snapshots taken from > incoming audio seeing how close they are to the library, but &#4294967295;then I > thought I'd better ask to see if there any known techniques, papers, > etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with > source code examples would be appreciated. > > I know there's a lot of stuff out there, but I'm not technical enough > to quickly sift out the best technique or to understand the possible > caveats of using one technique over another; especially if the answer > is given in high level math form. &#4294967295;That's why I'm asking for some > tips. > > Thanks, > Robert
Robert, How do you capture the sounds? Are you working from recordings? What kind of sounds? In what kind of environment (recordingstudio, quiet room, public room, jail cell, street, shopping mall...)? Are there competing sounds (speakers, music, traffic, ...)? Is there reverberation? How about some other filtering affect on the sounds? ..., ... There are so many things that will affect your FFTs. Why have you picked FFTs? How do you plan to match the FFTs? Is this a real project (for government, industry) or a toy project (for a class)? The first step is to define your requirements. Dirk
Hi Robert,

First, ignore Rune sarcastic comments- he aspires to be mediocre. 

What you are trying to do is not easy; it is pushing the limits of what
current technology can do.  Nevertheless, your brain can do it, so it is
certainly possible.  As an example, suppose you want to recognize the sound
of a brick being dropped on the ground.  You might start by dropping 100
different kinds of bricks, recording the sounds, and then comparing the
data.    From hearing these 100 sounds your brain can easily learn to
recognize the event.  However, when you graph the 100 sound waves they all
look different.  In other words, you cannot recognize a sound simply by
matching the sound wave with some previously recorded sound wave.  To solve
this problem, you need to find what these 100 sound waves have in common. 
These commonalities are what define the &ldquo;sound of a falling brick.&rdquo;  
These are such things as the frequency content, the duration, the onset
abruptness, ending decay, and so on.   If you want to fool around in this
area, start with extremely simple tasks, and you might see some success. 
Good luck.
Steve   
On 22 Nov, 19:12, "SteveSmith" <Steve.Smi...@SpectrumSDI.com> wrote:
> Hi Robert, > > First, ignore Rune sarcastic comments- he aspires to be mediocre.
Well, I certainly don't point other people into wasting time (and possibly carreers) on BS.
> start with extremely simple tasks,
That's about the only sane thing you've said all week...
> and you might see some success.
... why the cautious term "might"? If you're so certain why not a more confident statement? It's not *your* time (or carreer) at stake? Whimp. Rune
Ouch again!  Once more I have been put in my place by your sharp wit and
compelling arguments.   

roschler wrote:
> Can anyone recommend a good starting point for creating code that does > Sound Identification/Matching? I was going to start creating a > library of FFT snapshots consisting of varying time window lengths for > the sounds I want to identify in audio streams. Then I was going to > start matching up sounds that way by comparing snapshots taken from > incoming audio seeing how close they are to the library, but then I > thought I'd better ask to see if there any known techniques, papers, > etc. that would help me avoid reinventing the wheel. Anything with > source code examples would be appreciated. > > I know there's a lot of stuff out there, but I'm not technical enough > to quickly sift out the best technique or to understand the possible > caveats of using one technique over another; especially if the answer > is given in high level math form. That's why I'm asking for some > tips.
Robert, Well, I have some experience with this - although in a different application area. Each application area will have its challenges and likely some things that are easier to deal with as compared to other applications. A team of US experts assessed 20 years of computer generated sound matching in difficult situations. They did this because there were plenty of examples of failures as well as examples of some kinds of success. I would suggest these things to get you started (or to discourage you): - There will be false alarms/alerts and there will be false rests/misses. One works against the other. Expect that a high false alarm rate will cause the system to be practically useless. Ask yourself how expensive a false rest will be. You will have to give up alerts in order to make the false alert rate be acceptable. At what point is the false rest rate unacceptable? You don't have to build a system to postulate what your operating requirements are going to be. This is a very big deal and, in some sense, easy to decide. What's hard is knowing what any system will yield in this regard. - If starting from scratch, you will need to decide what parameters of the sound to extract in order to do the comparison. Consider this: "the larger the number of parameters, the higher the necessary signal to noise ratio". Why is that? Because you will naturally consider the "evident" parameters first. They are evident because their signal to noise ratio is high. As you add parameters, it's likely that thier individual SNRs will be lower than the initally-selected ones. Over all, this pushes up the total required SNR if you're going to make good use of them. Eventually this can make the system unusable in important situations. This suggests keeping the number of parameters suitably low. - As part of the parameter extraction process you need to decide on suitable temporal epochs. What's the likelihood of identifying Lincoln's Gettysburg Address as compared to a sneeze or a sneeze as compared to a phrase "help me!" The objectives here need to make reasonable sense. - As part of the parameter extraction process you can consider spectral character and temporal character and the combination thereof. _ Consider the difference between: .. signal detection (as in the fundamental process of receiving and possibly prefiltering a signal in noise) .. parameter extraction (getting the temporal and spectral and maybe spatial details as a set of numbers / measures) .. signal classification (deciding how the parameter measures stack up against a standard) A matched filter for a radar may do all 3 more or less at the same time. Building a matched filter for a particular voiced phrase may be a lot harder. Building an effective classifier can be pretty difficult. In speech, there are similar terms referred to in: http://en.wikipedia.org/wiki/Speech_recognition The fundamentals are pretty much the same though... It's not impossible. We have all interacted with voice-actuated telephone menu systems haven't we? There, there's a limited dictionary of one or two-word phrases. Say the wrong thing and the system doesn't recognize - fails (in a good sense) to classify the phrase as being in the dictionary. No matter what, it all boils down to those considerations above. My sense is that you might be working in a high SNR situation like the voice menus? That makes it easier. Have you considered something like Dragon Speaking? Voice to text already works. Then text to dictionary?? That it might be challenging can be inferred from Wikipedia's comment in Speech Recognition in Linux: http://en.wikipedia.org/wiki/Speech_recognition_in_Linux "There is currently no open-source equivalent of proprietary speech recognition software (e.g. Nuances Dragon NaturallySpeaking or Windows Speech Recognition) for Linux." Gee, with all the great efforts in open-source, one can but wonder why? But it appears there *are* open-source pieces. They're mentioned in the Linux link above. Fred
On Nov 23, 9:31&#4294967295;am, "Fred Marshall" <fmarshallx@remove_the_x.acm.org>
wrote:
> roschler wrote: > > Can anyone recommend a good starting point for creating code that does > > Sound Identification/Matching? &#4294967295;I was going to start creating a > > library of FFT snapshots consisting of varying time window lengths for > > the sounds I want to identify in audio streams. &#4294967295;Then I was going to > > start matching up sounds that way by comparing snapshots taken from > > incoming audio seeing how close they are to the library, but &#4294967295;then I > > thought I'd better ask to see if there any known techniques, papers, > > etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with > > source code examples would be appreciated. > > > I know there's a lot of stuff out there, but I'm not technical enough > > to quickly sift out the best technique or to understand the possible > > caveats of using one technique over another; especially if the answer > > is given in high level math form. &#4294967295;That's why I'm asking for some > > tips. > > Robert, > > Well, I have some experience with this - although in a different application > area. &#4294967295;Each application area will have its challenges and likely some things > that are easier to deal with as compared to other applications. > > A team of US experts assessed 20 years of computer generated sound matching > in difficult situations. &#4294967295;They did this because there were plenty of > examples of failures as well as examples of some kinds of success. > > I would suggest these things to get you started (or to discourage you): > > - There will be false alarms/alerts and there will be false rests/misses. > One works against the other. &#4294967295;Expect that a high false alarm rate will cause > the system to be practically useless. &#4294967295;Ask yourself how expensive a false > rest will be. &#4294967295;You will have to give up alerts in order to make the false > alert rate be acceptable. &#4294967295;At what point is the false rest rate > unacceptable? &#4294967295;You don't have to build a system to postulate what your > operating requirements are going to be. &#4294967295;This is a very big deal and, in > some sense, easy to decide. &#4294967295;What's hard is knowing what any system will > yield in this regard. > > - If starting from scratch, you will need to decide what parameters of the > sound to extract in order to do the comparison. &#4294967295;Consider this: &#4294967295;"the larger > the number of parameters, the higher the necessary signal to noise ratio". > Why is that? &#4294967295;Because you will naturally consider the "evident" parameters > first. &#4294967295;They are evident because their signal to noise ratio is high. &#4294967295;As > you add parameters, it's likely that thier individual SNRs will be lower > than the initally-selected ones. &#4294967295;Over all, this pushes up the total > required SNR if you're going to make good use of them. &#4294967295;Eventually this can > make the system unusable in important situations. &#4294967295;This suggests keeping the > number of parameters suitably low. > > - As part of the parameter extraction process you need to decide on suitable > temporal epochs. &#4294967295;What's the likelihood of identifying Lincoln's Gettysburg > Address as compared to a sneeze or a sneeze as compared to a phrase "help > me!" &#4294967295;The objectives here need to make reasonable sense. > > - As part of the parameter extraction process you can consider spectral > character and temporal character and the combination thereof. > > _ Consider the difference between: > &#4294967295; .. signal detection (as in the fundamental process of receiving and > possibly prefiltering a signal in noise) > &#4294967295; .. parameter extraction (getting the temporal and spectral and maybe > spatial details as a set of numbers / measures) > &#4294967295; .. signal classification (deciding how the parameter measures stack up > against a standard) > A matched filter for a radar may do all 3 more or less at the same time. > Building a matched filter for a particular voiced phrase may be a lot > harder. &#4294967295;Building an effective classifier can be pretty difficult. > > In speech, there are similar terms referred to in:http://en.wikipedia.org/wiki/Speech_recognition > The fundamentals are pretty much the same though... > > It's not impossible. &#4294967295;We have all interacted with voice-actuated telephone > menu systems haven't we? &#4294967295;There, there's a limited dictionary of one or > two-word phrases. &#4294967295;Say the wrong thing and the system doesn't recognize - > fails (in a good sense) to classify the phrase as being in the dictionary. > No matter what, it all boils down to those considerations above. > > My sense is that you might be working in a high SNR situation like the voice > menus? &#4294967295;That makes it easier. > > Have you considered something like Dragon Speaking? &#4294967295;Voice to text already > works. &#4294967295;Then text to dictionary?? > > That it might be challenging can be inferred from Wikipedia's comment in > Speech Recognition in Linux:http://en.wikipedia.org/wiki/Speech_recognition_in_Linux > > "There is currently no open-source equivalent of proprietary speech > recognition software (e.g. Nuances Dragon NaturallySpeaking or Windows > Speech Recognition) for Linux." > > Gee, with all the great efforts in open-source, one can but wonder why? &#4294967295;But > it appears there *are* open-source pieces. &#4294967295;They're mentioned in the Linux > link above. > > Fred
I would look at Speech recognition methods as a starting point and even voice identification. Of course it's possible - not easy - but possible.In fact bloody hard. Good luck - should take about 10 years minimum. Hardy
SteveSmith wrote:
> Ouch again! Once more I have been put in my place by your sharp wit and > compelling arguments.
We all know that alternate sets of basis functions can serve for decomposition and reconstitution. Just as sinusoids or exponentials serve as basis sets for the Fourier transform, Walsh functions are the basis set for the Hadamard transform. Harr is to Walsh as wavelet is to sinusoid, and so it goes. Clearly, some basis sets suit some problem domains better than others, notwithstanding that all are essentially interchangeable. I don't think that Rune is unaware of any of this. What sticks in his craw is the notion that a basis set somehow derived from the physics of the problem at hand will provide some holy grail of analysis. While I basically agree with him, I haven't suffered the evidently traumatic experience that gives him his fervor. Consider the "Fourier" transform we all know and live. As originally propounded, is involved sines and cosines. Sines and cosines of any frequency are orthogonal to those of any other, and sines are orthogonal to cosines. Moreover, any continuous expression can be decomposed (over a finite interval) into a finite set of sines and cosines. Good so far. We can, for many purposes, adopt a better set for that purpose: complex exponentials. Instead of two types, one suffices because of the usual two-for-one complex vis-a-vis real sleight of hand. The manipulative simplicity of the exponential basis set -- we call it a form, but it is in fact a different but equivalent set -- comes at the cost of time or frequency needing to be a signed quantity. Conceptual simplicity is traded for computational simplicity. Other basis sets trade other attributes. I doubt that it will ever be useful to characterize sounds as superpositions and transpositions of bird calls, although it wouldn't surprise me if it were possible to do that. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
On 22 Nov, 20:31, "SteveSmith" <Steve.Smi...@SpectrumSDI.com> wrote:
> Ouch again! &#4294967295;Once more I have been put in my place by your sharp wit and > compelling arguments. &#4294967295;
There is a long and honorable tradition in a field I have been very peripherically in touch with - submarines - that the responsible designers and builders of new subs join the crew on the vessel's maiden voyage. I can think of two reasons for that: Primarily that the crew sees that the builders have confidence in their own work. And secondly, in the case such confidence is misplaced, that the responible parties don't get a chance to jeopardize another crew, but perish themselves with their flawed work. These are attitudes that are well observed elswehere as well: Don't encourage others to do what you don't want do or is unable to do yourself. Such behaviours are generally frowned upon - there are even psychiatric diagnoses for some of them. Rune