comp.dsp | Finding a selected word in a audio recording file

Hi.
I want to make a program which let's the user to select a certain word
(or sound) from an audio recording and search it for other occurence of
it. I am not interested to make the program to recognize that word, but
only to find other sounds(words) from the recording which resembles the
selected portion of the sound.
Please tell me how could I do this.
Thank you very much.
Paul

Reply by Rune Allnor ●October 30, 20062006-10-30

paulgfx@gmail.com skrev:
> Hi.
> I want to make a program which let's the user to select a certain word
> (or sound) from an audio recording and search it for other occurence of
> it. I am not interested to make the program to recognize that word, but
> only to find other sounds(words) from the recording which resembles the
> selected portion of the sound.
> Please tell me how could I do this.

First of all, forget about *words*. The acoustic signatures of a child
and an old drunk (imagine a plastered Louis Armstrong or Lee Marvin)
who utters the same word, are vastly different.

Finding similar *sounds* is easy: Use cross correlation. Just remember
that any background noise in either the reference or the measurement
tend to deteriorate the detection rate.

Rune

Reply by Major Misunderstanding ●October 30, 20062006-10-30

<paulgfx@gmail.com> wrote in message
news:1162208722.821066.277560@e64g2000cwd.googlegroups.com...
> Hi.
> I want to make a program which let's the user to select a certain word
> (or sound) from an audio recording and search it for other occurence of
> it. I am not interested to make the program to recognize that word, but
> only to find other sounds(words) from the recording which resembles the
> selected portion of the sound.
> Please tell me how could I do this.
> Thank you very much.
> Paul
>

Ask the NSA or CIA - they do it all teh time on your phone calls.


M.



-- 
Posted via a free Usenet account from http://www.teranews.com

Reply by Vladimir Vassilevsky ●October 30, 20062006-10-30

Rune Allnor wrote:

> 
> Finding similar *sounds* is easy: Use cross correlation.

Not a good idea. A frequency/phase distortion as well as a compression 
like mp3 will destroy the cross correlation completely.

  Just remember
> that any background noise in either the reference or the measurement
> tend to deteriorate the detection rate.

Compute the energy spectrum on every ~10ms, warp it log(frq) 
log(amplitude), and correlate the 2d sequences of the spectrums.

Vladimir Vassilevsky

DSP and Mixed Signal Design Consultant

http://www.abvolt.com

Reply by ●October 30, 20062006-10-30

What I need exactely is finding similar sounds (or words pronounced by
the same person).

>
> Compute the energy spectrum on every ~10ms, warp it log(frq)
> log(amplitude), and correlate the 2d sequences of the spectrums.
>

I did a simple test and I found that spectograms of the same words
looks quite similar, so a correlation should work. I will try this.
Thanks.

Reply by Rune Allnor ●October 30, 20062006-10-30

Vladimir Vassilevsky skrev:
> Rune Allnor wrote:
>
>
> >
> > Finding similar *sounds* is easy: Use cross correlation.
>
> Not a good idea. A frequency/phase distortion as well as a compression
> like mp3 will destroy the cross correlation completely.

Maybe I used the wrong term. I interpreted the original post as a
search for
similar waveforms. The term "sound" above is a synonym for "waveform".
You are right, but the OP did not mention anything about data formats.

>   Just remember
> > that any background noise in either the reference or the measurement
> > tend to deteriorate the detection rate.
>
> Compute the energy spectrum on every ~10ms, warp it log(frq)
> log(amplitude), and correlate the 2d sequences of the spectrums.

Granted, not what I suggested, but it's still a cross correlation?

Out of curiosity: How robust is this with respect to pitch, "speed"
of talk, intonation etc? I can't really see that one can avoid the full

speech recognition machinery to detect words or phrases with any
degree of robustness?

Rune

Reply by glen herrmannsfeldt ●October 31, 20062006-10-31

Vladimir Vassilevsky wrote:

> Rune Allnor wrote:

>> Finding similar *sounds* is easy: Use cross correlation.

> Not a good idea. A frequency/phase distortion as well as a compression 
> like mp3 will destroy the cross correlation completely.

I think cross correlation is right.  What he didn't say was what
function you wanted to cross correlate on.  As you say, reasonably
likely not on the digitized waveform itself, but some function of
the waveform.

-- glen

Reply by ●November 1, 20062006-11-01

paulgfx@gmail.com writes:

> I want to make a program which let's the user to select a certain word
> (or sound) from an audio recording and search it for other occurence of
> it. I am not interested to make the program to recognize that word, but
> only to find other sounds(words) from the recording which resembles the
> selected portion of the sound.
> Please tell me how could I do this.

Here's one way.  Firstly split the all audio into 10ms frames of
cepstral or Perceptual Linear Prediction (PLP) coeffients.  The selected
portion of sound is your template, the rest is what you have to search
over.  Slide the template over the rest and compute the Euclidean
distance.  Those points with low Euclidean distance will be similar
sounds.  Of course this doesn't allow for different lengths, so the next
step is to construct a Hidden Markov Model (HMM) from the template,
perhaps using minimum description length (MDL) to get the right number
of states.  You'll have to set the variances of each state carefully, as
you'll only have a few observations.  There's an efficient dynamic
programming solution which allows you to slide the HMM template over all
your audio and get a match, it's similar to the Viterbi algorithm.
That'll work better, perhaps good enough for what you want, if not then
you are into the realm of reading research papers.

I've cross-posted to comp.speech.research.  I'd suggest googling for
what you don't know as regards cepstral/PLP/MDL/HMM/Viterbi and also
word spotting, looking at HTK from http://htk.eng.cam.ac.uk, and asking
more questions on comp.speech.research.

Hope that gives you some pointers to get started.

Tony

Reply by this-email-address-is-invalid ●November 2, 20062006-11-02

Since the number of observations is so small, perhaps it would be
easier to use dynamic time warping instead of an HMM?

Some references on DTW that might be handy:

http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html
http://web1.mtnl.net.in/~nilami/dtw.html
http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

Reply by ●November 3, 20062006-11-03

"this-email-address-is-invalid" <dave_gelbart@yahoo.com> writes:

> Since the number of observations is so small, perhaps it would be
> easier to use dynamic time warping instead of an HMM?
> 
> Some references on DTW that might be handy:
> 
> http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html
> http://web1.mtnl.net.in/~nilami/dtw.html
> http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

Yes, good point, DTW is a useful learning step to pass through between
template matching and HMMs.

In practice the HMMs will end up being really simple (single mean, fixed
variances), and the DTW will have to be modified from the standard
transitions to something HMM like if you are going to do efficient
matching, so I think the two converge to almost the same thing.

Tony

Finding a selected word in a audio recording file

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group