DSPRelated.com
Forums

Sound Identification/Matching - good starting point?

Started by roschler November 22, 2008
On Nov 23, 1:46&#4294967295;pm, "Fred Marshall" <fmarshallx@remove_the_x.acm.org>
wrote:
> > Well, I don't want to complicate things but you might consider doing pattern > recognition on a 2-D "image" of spectral density vs. time with a particular > set of temporal characteristics. &#4294967295;That would bring in image processing > techniques but is somewhat the idea in identifying sounds - to look at the > time variation of the spectral density as a pattern. > > When you add the complexity of the "uninteresting" TV then I can imagine it > tuned to "The Dog Whisperer" while you're trying to tell if your own dog is > barking!! &#4294967295;Anyway, that suggests perhaps a noise canceller using the TV > output as a pre-processing step. &#4294967295;I'm not sure how hard or easy that would > be as you'd likely have to delay the classification stream to accomodate > dealing with the rapidly varying TV output. &#4294967295;I don't know if that's been > done. &#4294967295;It might look like this: > > microphone >> delay &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; >> summing point > TV &#4294967295; &#4294967295; &#4294967295; &#4294967295; >> adaptive filter >> &#4294967295; &#4294967295; &#4294967295;^ > > The delay is necessary so that the delay of the adaptive filter doesn't > misalign the TV signal at the summing point. &#4294967295;You need the cancellation to > be aligned I do believe. > > It's a lot less complicated without this...... and it's still complicated!! > > Fred
Fred, Well then I'll plan on at least doing a spectrum vs time analysis then, rather than a static snapshot of a large duration sound sampling. The great thing about a problem like this is even if I don't solve it completely, it's still incredibly fun to try. Thanks, Robert
Rune, Rune Rune...

Now you've motivated me to take action.  Look for the new post. 

On Nov 24, 7:58&#4294967295;am, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 23 Nov, 19:06, HardySpicer <gyansor...@gmail.com> wrote: > > > I cannot agree that because humans can do it then it would take a > > machine eons to do likewise. Speech recognition is a hard task and is > > just about there > > now. Some engines are very accurate - say 99% al be it in an > > environment with a high SNR. That has taken about 50 years. > > The problem in speech processing is rather well-conditioned > compared to most other applications: > > - The signal is constrained (e.g. a phone line), so one > &#4294967295; knows the source is human > - The characteristics of human speech can be identified > &#4294967295; from experiments > - Such experiments can be performed in an ideal environment > &#4294967295; (anechoic chamber) > > and *still* it's a non-trivial exercise. > > Steve said that "it's easy for the human brain" to do > these identifications; I'm thinking it takes a human > brain to achieve it. > > Rune
Speech recognition is a bloody hard problem. It has taken whole teams of linguists,computer scientists and engineers to sort it out. Hardy
On Nov 22, 12:54&#4294967295;pm, roschler <robert.osch...@gmail.com> wrote:
> Can anyone recommend a good starting point for creating code that does > Sound Identification/Matching? &#4294967295;I was going to start creating a > library of FFT snapshots consisting of varying time window lengths for > the sounds I want to identify in audio streams. &#4294967295;Then I was going to > start matching up sounds that way by comparing snapshots taken from > incoming audio seeing how close they are to the library, but &#4294967295;then I > thought I'd better ask to see if there any known techniques, papers, > etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with > source code examples would be appreciated. > > I know there's a lot of stuff out there, but I'm not technical enough > to quickly sift out the best technique or to understand the possible > caveats of using one technique over another; especially if the answer > is given in high level math form. &#4294967295;That's why I'm asking for some > tips. > > Thanks, > Robert
Hi, I couldn't be bothered to read every post so if what I have to add has already come forward, you may feel free to ignore it. I've written software which does exactly what you are referring to and it was divided into two main parts: a dsp part and a database part. It works like this: for any length of audio i want to create a description of, the dsp part analyzes the incoming audio, creating overlapping snapshots of fft data. These snapshots are then added together and this summing helps to multiply the most distinctive frequencies in the signal. The next step was storing the data. The global snapshot was now processed into frequency buckets and a float number constructed for each of the buckets, it's value based on the amplitude of their frequencies. Each bucket's number now denotes a fictional length. From the length of the all the buckets, multidimensional vectors are constructed. Vectors would be a good fit for your project since they work on a nearest neighbour approach. This basically means that the vector constructed by one recording of a church bell doesn't need to be exactly the same for it to match to another recording of a church bell, i.e. the difference between the sounds is the distance between the vectors describing them. I hope some of this makes sense.
On Nov 26, 4:58&#4294967295;am, Esteban Gaviz <esteban.supr...@gmail.com> wrote:
> On Nov 22, 12:54&#4294967295;pm, roschler <robert.osch...@gmail.com> wrote: > > > Can anyone recommend a good starting point for creating code that does > > Sound Identification/Matching? &#4294967295;I was going to start creating a > > library of FFT snapshots consisting of varying time window lengths for > > the sounds I want to identify in audio streams. &#4294967295;Then I was going to > > start matching up sounds that way by comparing snapshots taken from > > incoming audio seeing how close they are to the library, but &#4294967295;then I > > thought I'd better ask to see if there any known techniques, papers, > > etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with > > source code examples would be appreciated. > > > I know there's a lot of stuff out there, but I'm not technical enough > > to quickly sift out the best technique or to understand the possible > > caveats of using one technique over another; especially if the answer > > is given in high level math form. &#4294967295;That's why I'm asking for some > > tips. > > > Thanks, > > Robert > > Hi, I couldn't be bothered to read every post so if what I have to add > has already come forward, you may feel free to ignore it. I've written > software which does exactly what you are referring to and it was > divided into two main parts: a dsp part and a database part. It works > like this: for any length of audio i want to create a description of, > the dsp part analyzes the incoming audio, creating overlapping > snapshots of fft data. These snapshots are then added together and > this summing helps to multiply the most distinctive frequencies in the > signal. The next step was storing the data. The global snapshot was > now processed into frequency buckets and a float number constructed > for each of the buckets, it's value based on the amplitude of their > frequencies. Each bucket's number now denotes a fictional length. From > the length of the all the buckets, multidimensional vectors are > constructed. Vectors would be a good fit for your project since they > work on a nearest neighbour approach. This basically means that the > vector constructed by one recording of a church bell doesn't need to > be exactly the same for it to match to another recording of a church > bell, i.e. the difference between the sounds is the distance between > the vectors describing them. I hope some of this makes sense.
Esteban, Yes it does and thank you for your suggestions. What classification method did you use to compare vectors? I thought about using PCA (Principal Components Analysis) on the spectral profiles I created for each sound, and then comparing new profiles to these Eigenvectors. Also, how did the FFT profile created by using your overlap/add method compare to that of one created by performing an FFT over a time window that encompassed the entire sound? Thanks, Robert
roschler wrote:
> Can anyone recommend a good starting point for creating code that does > Sound Identification/Matching?
NB: those are two distinct jobs. You may be able to match one sound with another, without being able to identify either of them. Keywords for searching: spectral modelling; audio content analysis I finally tracked down a source for the reference I think may be a good starting point, the key subject being MPEG-7 sound descriptors: "A proposal for the description of audio in the context of MPEG-7" http://en.scientificcommons.org/541620 (download via the citeseer link) And the full MPEG-7 Reference Software (mostly C++) is available on a DVD with the moderately expensive book: "Introduction to MPEG-7: Multimedia Content Description Interface" http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471486787.html I downloaded it all only a year ago for my archives, so the software package ~may~ still be available online somewhere. Whether it counts as a "good starting point" for creating code I leave to you to decide. Someone wrote earlier that "the brain can do it easily". It has to be pointed out however that the brain is also easily deceived, in different ways depending (among other things) on the absence or presence of visual support. Most of the sounds in a film, for example, are not produced by what you see on the screen! Richard Dobson
On Nov 26, 7:01&#4294967295;pm, roschler <robert.osch...@gmail.com> wrote:
> On Nov 26, 4:58&#4294967295;am, Esteban Gaviz <esteban.supr...@gmail.com> wrote: > > > > > On Nov 22, 12:54&#4294967295;pm, roschler <robert.osch...@gmail.com> wrote: > > > > Can anyone recommend a good starting point for creating code that does > > > Sound Identification/Matching? &#4294967295;I was going to start creating a > > > library of FFT snapshots consisting of varying time window lengths for > > > the sounds I want to identify in audio streams. &#4294967295;Then I was going to > > > start matching up sounds that way by comparing snapshots taken from > > > incoming audio seeing how close they are to the library, but &#4294967295;then I > > > thought I'd better ask to see if there any known techniques, papers, > > > etc. that would help me avoid reinventing the wheel. &#4294967295;Anything with > > > source code examples would be appreciated. > > > > I know there's a lot of stuff out there, but I'm not technical enough > > > to quickly sift out the best technique or to understand the possible > > > caveats of using one technique over another; especially if the answer > > > is given in high level math form. &#4294967295;That's why I'm asking for some > > > tips. > > > > Thanks, > > > Robert > > > Hi, I couldn't be bothered to read every post so if what I have to add > > has already come forward, you may feel free to ignore it. I've written > > software which does exactly what you are referring to and it was > > divided into two main parts: a dsp part and a database part. It works > > like this: for any length of audio i want to create a description of, > > the dsp part analyzes the incoming audio, creating overlapping > > snapshots of fft data. These snapshots are then added together and > > this summing helps to multiply the most distinctive frequencies in the > > signal. The next step was storing the data. The global snapshot was > > now processed into frequency buckets and a float number constructed > > for each of the buckets, it's value based on the amplitude of their > > frequencies. Each bucket's number now denotes a fictional length. From > > the length of the all the buckets, multidimensional vectors are > > constructed. Vectors would be a good fit for your project since they > > work on a nearest neighbour approach. This basically means that the > > vector constructed by one recording of a church bell doesn't need to > > be exactly the same for it to match to another recording of a church > > bell, i.e. the difference between the sounds is the distance between > > the vectors describing them. I hope some of this makes sense. > > Esteban, > > Yes it does and thank you for your suggestions. &#4294967295;What classification > method did you use to compare vectors? &#4294967295;I thought about using PCA > (Principal Components Analysis) on the spectral profiles I created for > each sound, and then comparing new profiles to these Eigenvectors. > Also, how did the FFT profile created by using your overlap/add method > compare to that of one created by performing an FFT over a time window > that encompassed the entire sound? > > Thanks, > Robert
> What classification method did you use to compare vectors?
For development I used simple euclidean distance but there are a number of high dimensional database approaches that can match the vectors more effectively. Since I am, however, not really following the latest developments in the field and the different techniques are tailored for different things (f.i. speed vs. size of data set vs. number of false positives) I'll leave googling them to you.
> how did the FFT profile created by using your overlap/add method compare to that of one created by performing an FFT over a time window that encompassed the entire sound?
The more simple the input, the more similar the FFT you describe would be. The more complex, the more dissimilar. What I wanted to do was to describe complex polyphonic signals and the overlapping served to draw forward the most relevant frequencies, since these contained all the information I wanted. Keep in mind that these overlapping slices are tiny. PCA did not apply since my frequency buckets were bound to certain values based on the logarithmic scale of musical notes and I presume PCA would've skewed that scale. Might work though.
On Nov 26, 2:34&#4294967295;pm, Richard Dobson <richarddob...@blueyonder.co.uk>
wrote:

> > NB: those are two distinct jobs. You may be able to match one sound with > another, without being able to identify either of them. > > Keywords for searching: spectral modelling; audio content analysis > > I finally tracked down a source for the reference I think may be a good > starting point, the key subject being MPEG-7 sound descriptors: > > "A proposal for the description of audio in the context of MPEG-7" > > http://en.scientificcommons.org/541620 > > (download via the citeseer link) > > And the full MPEG-7 Reference Software (mostly C++) is available on a > DVD with the moderately expensive book: > > "Introduction to MPEG-7: Multimedia Content Description Interface" > > http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471486787.html > > I downloaded it all only a year ago for my archives, so the software > package ~may~ still be available online somewhere. Whether it counts as > a "good starting point" for creating code I &#4294967295;leave to you to decide. > > Someone wrote earlier that "the brain can do it easily". It has to be > pointed out however that the brain is also easily deceived, in different > ways depending (among other things) on the absence or presence of visual > support. &#4294967295;Most of the sounds in a film, for example, are not produced by > what you see on the screen! > > Richard Dobson
Thanks Richard. I'm going to check those references out now. Robert
> > On Nov 26, 4:58&#4294967295;am, Esteban Gaviz <esteban.supr...@gmail.com> wrote: > > For development I used simple euclidean distance but there are a > number of high dimensional database approaches that can match the > vectors more effectively. Since I am, however, not really following > the latest developments in the field and the different techniques are > tailored for different things (f.i. speed vs. size of data set vs. > number of false positives) I'll leave googling them to you. > > > how did the FFT profile created by using your overlap/add method compare to that of one created by performing an FFT over a time window that encompassed the entire sound? > > The more simple the input, the more similar the FFT you describe would > be. The more complex, the more dissimilar. What I wanted to do was to > describe complex polyphonic signals and the overlapping served to draw > forward the most relevant frequencies, since these contained all the > information I wanted. Keep in mind that these overlapping slices are > tiny. > > PCA did not apply since my frequency buckets were bound to certain > values based on the logarithmic scale of musical notes and I presume > PCA would've skewed that scale. Might work though.
Thanks Esteban. Definite food for thought. Robert.