# A Sound Pattern Detection Within A Continious Audio Stream

Started by January 22, 2006
```Hello everybody!

Here is a kind of silly question, you know.

I write a radio station program that reacts on some sound events detected
within a continious audio stream received through a sound card input
channel or whatever. A program runs and at the specified moment starts to
listen to the input audio stream and tries to detect a sound pattern that
it's been trained before (a PCM file or so). That sound samples for
training are recorded far before and pointed to the program that it has to
react to each of them at a proper time, the one at the time. "No
overlapping in time" is guaranteed. Moreover, the audio stream contains
music and speach, just an audio that a radio station can produce.

The problem seems to be common. I've found some explanations at this forum
but they are not enough. The algorythm has to be fast and reliable, but to
be precise is the preferable way.

Can anyone help? ;)

```
```..some kind of an audio/sound event detection...
```
```hello,

heres an overview about what i would do to solve
stream of samples, preferrably something between
8kHz and 44.1kHz in 16 bit. Mono should be much
easyer than stereo.

STREAM, FRAMES AND WINDOWING:
first divide your stream into 'frames', which overlap by
50% with the previous and next frame. have 50 up to
100 frames per second.
apply a window-function on the frame-data.
thats multiplying with a cosine-function, so the frame
starts and ends with silence, leaving the data in the
middle almost like it was.
(look up hanning-window-function. thats why you need
overlapping frames, but it will give better results for the
next step.)

FFT AND MAGNITUDE:
now you perform a FFT on the windowed frame, to get the
frequency-coefficients, otherwise called a spectrum.
as each coefficient has a real and imaginary part,
use sqrt(re*re + im*im) to calculate only magnitude.

MEL-SCALED-FILTERBANK:
usually this spectrum has 128, 256, 512 or 1024 coefficients,
so you would group some and continue with 24 or 48.
(look up mel-scale or mel-frequencies to find examples.)
and calculate the logarithm of each value.

CEPSTRUM:
for speech-recognition i apply another DCT or FFT on those
mel-coefficients. that result is called 'real cepstrum'. (no joke! )
cepstrum-coefficients describe the shape (or envelope) of the
spectrum, very useful for comparing speech-samples and music.

COMPARISION:
now you have a several coefficients for each frame and need to
compare a line of frames with the previously transformed examples,
which you want to recognize.
i use spectrum and lower half of cepstrum for comparisions.

the same speed, you don't have to worry about time-warping,
which deals about comparing slow-spoken and fast-spoken words.

so much for now, i tried to keep it simple, but it must appear
confusing or frightening. perhaps you can find (or somebody
can recommend) a library, which does the processing for you.
in any case you will need to do some work and private research.

hope it helps,

carsten neubauer
http://www.c14sw.de/
speech recognition, compression,
custom development

```
```if the training audio is an original  source like a CD or a commercial
produced in a studio, and the radio audio is the "off air" audio from

be aware that a radio station does a lot of nasty things to the audio,
so it may "sound" the same but the dynamics will have been severely
modifed by the AGC and limiters that most every radio station uses  to
make themselves loud.  (They wrongly think that makes people want to
listen)  Also these days many radio stations probably store much of
their audio content on a server maybe using compression like MP3.  So
if you use a "dumb" comparison algorithm that just tries to match bits.
it will fail to match.... even though to a human, the audio "sounds"
pretty much the same...

Mark

```
```Thank you Carsten Neubauer for your response. I think your thorough
explainations will be very useful for me. Thank you again.

I wonder if there any library concerning the task? I'm new to audio
programming in dsp fields and I think the library that could show the
right way will be very helpful for me.

```
```>
>if the training audio is an original  source like a CD or a commercial
>produced in a studio, and the radio audio is the "off air" audio from
>
>be aware that a radio station does a lot of nasty things to the audio,
>so it may "sound" the same but the dynamics will have been severely
>modifed by the AGC and limiters that most every radio station uses  to
>make themselves loud.  (They wrongly think that makes people want to
>listen)  Also these days many radio stations probably store much of
>their audio content on a server maybe using compression like MP3.  So
>if you use a "dumb" comparison algorithm that just tries to match bits.
> it will fail to match.... even though to a human, the audio "sounds"
>pretty much the same...
>
>Mark
>
>

Yes, Mark, you are absolutely right! The problem is that the training
pattern and the one within a radio stream vary in dynamics and quality.
Most likely, it may happen that the radio stream is too noisy.

Anyway, I'm sure there must be a way to solve the problem because I know
the commercial programms do that in an effecient way.

BTW, I've heard Neural Nets (i.e SOM) can do that. But I don't know how to
welcome Neural Nets to my program and what about their performance?

```
```Ptomaine wrote:
> I write a radio station program that reacts on some sound events
> detected within a continious audio stream received through a sound
> card input channel or whatever.

Not silly at all. Can you modify the audio stream coming from the
radio station? Back when I was working in audio watermarking, we
did some experiments related to getting a toy to react to sounds
coming over a TV set's speaker. We embedded inaudible triggers into
the cartoon's audio track, and when the watermark decoder in the toy
recognized the trigger via its built-in microphone, it would toggle
an output to make the toy do something.

Is that the sort of thing you're looking for?

-- Dave Tweed
```
```David Tweed wrote:
> Ptomaine wrote:
> > I write a radio station program that reacts on some sound events
> > detected within a continious audio stream received through a sound
> > card input channel or whatever.
>
> Not silly at all. Can you modify the audio stream coming from the
> radio station? Back when I was working in audio watermarking, we
> did some experiments related to getting a toy to react to sounds
> coming over a TV set's speaker. We embedded inaudible triggers into
> the cartoon's audio track, and when the watermark decoder in the toy
> recognized the trigger via its built-in microphone, it would toggle
> an output to make the toy do something.
>
> Is that the sort of thing you're looking for?
>
> -- Dave Tweed

that's interesting...

I am having a hard time thinking of a signal  that can pass though a TV
audio channel (50 to 15 kHz)  and TV speaker (even less than that) that
could trigger a toy an still be "inaudable".

Can you be more specific?

thanks.

Mark

```
```"Mark" <makolber@yahoo.com> writes:

> I am having a hard time thinking of a signal  that can pass though a TV
> audio channel (50 to 15 kHz)  and TV speaker (even less than that) that
> could trigger a toy an still be "inaudable".
>
> Can you be more specific?

My hearing cuts out at about 11kHz. :-)

Ciao,

Peter K.

--
"And he sees the vision splendid
of the sunlit plains extended
And at night the wondrous glory of the everlasting stars."

```
```Mark wrote:
> David Tweed wrote:
> > Not silly at all. Can you modify the audio stream coming from the
> > radio station? Back when I was working in audio watermarking, we
> > did some experiments related to getting a toy to react to sounds
> > coming over a TV set's speaker. We embedded inaudible triggers
> > into the cartoon's audio track, and when the watermark decoder in
> > the toy recognized the trigger via its built-in microphone, it
> > would toggle an output to make the toy do something.
>
> I am having a hard time thinking of a signal that can pass though
> a TV audio channel (50 to 15 kHz) and TV speaker (even less than
> that) that could trigger a toy an still be "inaudable".
>
> Can you be more specific?

Well, all audio watermarking is based on steganography -- the hiding
of information in another signal. There are a number of approaches,
and they are usually parameterizable in terms of the amount of data
that needs to be carried, the robustness against signal processing,
and the audibility. The latter is strongly affected by the overall
quality and nature of the "cover" signal.

Our algorithm could be "dialed in" for a wide range of applications.
At one end of the spectrum, we could hide quite a bit of data in CD
quality music with no audibility issues except for some very highly
trained listeners who knew exactly what they were looking for. For
TV audio applications, the data requirements were much less, the
robustness needed to be higher, and audibility was less of a concern.
The low quality of audio in the soundtrack to begin with could hide
a lot. You couldn't hear our signal in a TV speaker, but you could
definitely hear that particular watermark in the CD quality test
environment with a better cover signal.

It was my job to build a very low-cost decoder that could be built
into a toy. I ended up with a cheap electret microphone, about four
stages of opamps for gain and filtering, and an 8-bit microcontroller
(6502 derivative) to run the algorithm. We could play the audio
through a cheap PC speaker at one end of a conference room table,
and our box at the other end would light up a sequence of LEDs to
show that it had seen the triggers, even while people in the room
were talking. It was a pretty impressive demo, but I don't know
whether it ever made its way into any real products. This was about
six years ago.

-- Dave Tweed
```