Forums

Finding a selected word in a audio recording file

Started by Unknown October 30, 2006
Hi.
I want to make a program which let's the user to select a certain word
(or sound) from an audio recording and search it for other occurence of
it. I am not interested to make the program to recognize that word, but
only to find other sounds(words) from the recording which resembles the
selected portion of the sound.
Please tell me how could I do this.
Thank you very much.
Paul

paulgfx@gmail.com skrev:
> Hi. > I want to make a program which let's the user to select a certain word > (or sound) from an audio recording and search it for other occurence of > it. I am not interested to make the program to recognize that word, but > only to find other sounds(words) from the recording which resembles the > selected portion of the sound. > Please tell me how could I do this.
First of all, forget about *words*. The acoustic signatures of a child and an old drunk (imagine a plastered Louis Armstrong or Lee Marvin) who utters the same word, are vastly different. Finding similar *sounds* is easy: Use cross correlation. Just remember that any background noise in either the reference or the measurement tend to deteriorate the detection rate. Rune
<paulgfx@gmail.com> wrote in message
news:1162208722.821066.277560@e64g2000cwd.googlegroups.com...
> Hi. > I want to make a program which let's the user to select a certain word > (or sound) from an audio recording and search it for other occurence of > it. I am not interested to make the program to recognize that word, but > only to find other sounds(words) from the recording which resembles the > selected portion of the sound. > Please tell me how could I do this. > Thank you very much. > Paul >
Ask the NSA or CIA - they do it all teh time on your phone calls. M. -- Posted via a free Usenet account from http://www.teranews.com

Rune Allnor wrote:


> > Finding similar *sounds* is easy: Use cross correlation.
Not a good idea. A frequency/phase distortion as well as a compression like mp3 will destroy the cross correlation completely. Just remember
> that any background noise in either the reference or the measurement > tend to deteriorate the detection rate.
Compute the energy spectrum on every ~10ms, warp it log(frq) log(amplitude), and correlate the 2d sequences of the spectrums. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
What I need exactely is finding similar sounds (or words pronounced by
the same person).

> > Compute the energy spectrum on every ~10ms, warp it log(frq) > log(amplitude), and correlate the 2d sequences of the spectrums. >
I did a simple test and I found that spectograms of the same words looks quite similar, so a correlation should work. I will try this. Thanks.
Vladimir Vassilevsky skrev:
> Rune Allnor wrote: > > > > > > Finding similar *sounds* is easy: Use cross correlation. > > Not a good idea. A frequency/phase distortion as well as a compression > like mp3 will destroy the cross correlation completely.
Maybe I used the wrong term. I interpreted the original post as a search for similar waveforms. The term "sound" above is a synonym for "waveform". You are right, but the OP did not mention anything about data formats.
> Just remember > > that any background noise in either the reference or the measurement > > tend to deteriorate the detection rate. > > Compute the energy spectrum on every ~10ms, warp it log(frq) > log(amplitude), and correlate the 2d sequences of the spectrums.
Granted, not what I suggested, but it's still a cross correlation? Out of curiosity: How robust is this with respect to pitch, "speed" of talk, intonation etc? I can't really see that one can avoid the full speech recognition machinery to detect words or phrases with any degree of robustness? Rune
Vladimir Vassilevsky wrote:


> Rune Allnor wrote:
>> Finding similar *sounds* is easy: Use cross correlation.
> Not a good idea. A frequency/phase distortion as well as a compression > like mp3 will destroy the cross correlation completely.
I think cross correlation is right. What he didn't say was what function you wanted to cross correlate on. As you say, reasonably likely not on the digitized waveform itself, but some function of the waveform. -- glen
paulgfx@gmail.com writes:

> I want to make a program which let's the user to select a certain word > (or sound) from an audio recording and search it for other occurence of > it. I am not interested to make the program to recognize that word, but > only to find other sounds(words) from the recording which resembles the > selected portion of the sound. > Please tell me how could I do this.
Here's one way. Firstly split the all audio into 10ms frames of cepstral or Perceptual Linear Prediction (PLP) coeffients. The selected portion of sound is your template, the rest is what you have to search over. Slide the template over the rest and compute the Euclidean distance. Those points with low Euclidean distance will be similar sounds. Of course this doesn't allow for different lengths, so the next step is to construct a Hidden Markov Model (HMM) from the template, perhaps using minimum description length (MDL) to get the right number of states. You'll have to set the variances of each state carefully, as you'll only have a few observations. There's an efficient dynamic programming solution which allows you to slide the HMM template over all your audio and get a match, it's similar to the Viterbi algorithm. That'll work better, perhaps good enough for what you want, if not then you are into the realm of reading research papers. I've cross-posted to comp.speech.research. I'd suggest googling for what you don't know as regards cepstral/PLP/MDL/HMM/Viterbi and also word spotting, looking at HTK from http://htk.eng.cam.ac.uk, and asking more questions on comp.speech.research. Hope that gives you some pointers to get started. Tony
Since the number of observations is so small, perhaps it would be
easier to use dynamic time warping instead of an HMM?

Some references on DTW that might be handy:

http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html
http://web1.mtnl.net.in/~nilami/dtw.html
http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

"this-email-address-is-invalid" <dave_gelbart@yahoo.com> writes:

> Since the number of observations is so small, perhaps it would be > easier to use dynamic time warping instead of an HMM? > > Some references on DTW that might be handy: > > http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html > http://web1.mtnl.net.in/~nilami/dtw.html > http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/
Yes, good point, DTW is a useful learning step to pass through between template matching and HMMs. In practice the HMMs will end up being really simple (single mean, fixed variances), and the DTW will have to be modified from the standard transitions to something HMM like if you are going to do efficient matching, so I think the two converge to almost the same thing. Tony