Hello everybody! Here is a kind of silly question, you know. I write a radio station program that reacts on some sound events detected within a continious audio stream received through a sound card input channel or whatever. A program runs and at the specified moment starts to listen to the input audio stream and tries to detect a sound pattern that it's been trained before (a PCM file or so). That sound samples for training are recorded far before and pointed to the program that it has to react to each of them at a proper time, the one at the time. "No overlapping in time" is guaranteed. Moreover, the audio stream contains music and speach, just an audio that a radio station can produce. The problem seems to be common. I've found some explanations at this forum but they are not enough. The algorythm has to be fast and reliable, but to be precise is the preferable way. Can anyone help? ;)
A Sound Pattern Detection Within A Continious Audio Stream
Started by ●January 22, 2006
Reply by ●January 23, 20062006-01-23
Reply by ●January 23, 20062006-01-23
hello, your task sounds interesting, but is not simple. heres an overview about what i would do to solve that problem, assuming you start with a constant stream of samples, preferrably something between 8kHz and 44.1kHz in 16 bit. Mono should be much easyer than stereo. STREAM, FRAMES AND WINDOWING: first divide your stream into 'frames', which overlap by 50% with the previous and next frame. have 50 up to 100 frames per second. apply a window-function on the frame-data. thats multiplying with a cosine-function, so the frame starts and ends with silence, leaving the data in the middle almost like it was. (look up hanning-window-function. thats why you need overlapping frames, but it will give better results for the next step.) FFT AND MAGNITUDE: now you perform a FFT on the windowed frame, to get the frequency-coefficients, otherwise called a spectrum. as each coefficient has a real and imaginary part, use sqrt(re*re + im*im) to calculate only magnitude. MEL-SCALED-FILTERBANK: usually this spectrum has 128, 256, 512 or 1024 coefficients, so you would group some and continue with 24 or 48. (look up mel-scale or mel-frequencies to find examples.) and calculate the logarithm of each value. CEPSTRUM: for speech-recognition i apply another DCT or FFT on those mel-coefficients. that result is called 'real cepstrum'. (no joke! ) cepstrum-coefficients describe the shape (or envelope) of the spectrum, very useful for comparing speech-samples and music. COMPARISION: now you have a several coefficients for each frame and need to compare a line of frames with the previously transformed examples, which you want to recognize. i use spectrum and lower half of cepstrum for comparisions. as your your examples are prerecorded and always played at the same speed, you don't have to worry about time-warping, which deals about comparing slow-spoken and fast-spoken words. so much for now, i tried to keep it simple, but it must appear confusing or frightening. perhaps you can find (or somebody can recommend) a library, which does the processing for you. in any case you will need to do some work and private research. hope it helps, carsten neubauer http://www.c14sw.de/ speech recognition, compression, custom development
Reply by ●January 23, 20062006-01-23
if the training audio is an original source like a CD or a commercial produced in a studio, and the radio audio is the "off air" audio from the radio station....... be aware that a radio station does a lot of nasty things to the audio, so it may "sound" the same but the dynamics will have been severely modifed by the AGC and limiters that most every radio station uses to make themselves loud. (They wrongly think that makes people want to listen) Also these days many radio stations probably store much of their audio content on a server maybe using compression like MP3. So if you use a "dumb" comparison algorithm that just tries to match bits. it will fail to match.... even though to a human, the audio "sounds" pretty much the same... Mark
Reply by ●January 23, 20062006-01-23
Thank you Carsten Neubauer for your response. I think your thorough explainations will be very useful for me. Thank you again. I wonder if there any library concerning the task? I'm new to audio programming in dsp fields and I think the library that could show the right way will be very helpful for me.
Reply by ●January 23, 20062006-01-23
> >if the training audio is an original source like a CD or a commercial >produced in a studio, and the radio audio is the "off air" audio from >the radio station....... > >be aware that a radio station does a lot of nasty things to the audio, >so it may "sound" the same but the dynamics will have been severely >modifed by the AGC and limiters that most every radio station uses to >make themselves loud. (They wrongly think that makes people want to >listen) Also these days many radio stations probably store much of >their audio content on a server maybe using compression like MP3. So >if you use a "dumb" comparison algorithm that just tries to match bits. > it will fail to match.... even though to a human, the audio "sounds" >pretty much the same... > >Mark > >Yes, Mark, you are absolutely right! The problem is that the training pattern and the one within a radio stream vary in dynamics and quality. Most likely, it may happen that the radio stream is too noisy. Anyway, I'm sure there must be a way to solve the problem because I know the commercial programms do that in an effecient way. BTW, I've heard Neural Nets (i.e SOM) can do that. But I don't know how to welcome Neural Nets to my program and what about their performance?
Reply by ●January 24, 20062006-01-24
Ptomaine wrote:> I write a radio station program that reacts on some sound events > detected within a continious audio stream received through a sound > card input channel or whatever.Not silly at all. Can you modify the audio stream coming from the radio station? Back when I was working in audio watermarking, we did some experiments related to getting a toy to react to sounds coming over a TV set's speaker. We embedded inaudible triggers into the cartoon's audio track, and when the watermark decoder in the toy recognized the trigger via its built-in microphone, it would toggle an output to make the toy do something. Is that the sort of thing you're looking for? -- Dave Tweed
Reply by ●January 24, 20062006-01-24
David Tweed wrote:> Ptomaine wrote: > > I write a radio station program that reacts on some sound events > > detected within a continious audio stream received through a sound > > card input channel or whatever. > > Not silly at all. Can you modify the audio stream coming from the > radio station? Back when I was working in audio watermarking, we > did some experiments related to getting a toy to react to sounds > coming over a TV set's speaker. We embedded inaudible triggers into > the cartoon's audio track, and when the watermark decoder in the toy > recognized the trigger via its built-in microphone, it would toggle > an output to make the toy do something. > > Is that the sort of thing you're looking for? > > -- Dave Tweedthat's interesting... I am having a hard time thinking of a signal that can pass though a TV audio channel (50 to 15 kHz) and TV speaker (even less than that) that could trigger a toy an still be "inaudable". Can you be more specific? thanks. Mark
Reply by ●January 24, 20062006-01-24
"Mark" <makolber@yahoo.com> writes:> I am having a hard time thinking of a signal that can pass though a TV > audio channel (50 to 15 kHz) and TV speaker (even less than that) that > could trigger a toy an still be "inaudable". > > Can you be more specific?My hearing cuts out at about 11kHz. :-) Ciao, Peter K. -- "And he sees the vision splendid of the sunlit plains extended And at night the wondrous glory of the everlasting stars."
Reply by ●January 24, 20062006-01-24
Mark wrote:> David Tweed wrote: > > Not silly at all. Can you modify the audio stream coming from the > > radio station? Back when I was working in audio watermarking, we > > did some experiments related to getting a toy to react to sounds > > coming over a TV set's speaker. We embedded inaudible triggers > > into the cartoon's audio track, and when the watermark decoder in > > the toy recognized the trigger via its built-in microphone, it > > would toggle an output to make the toy do something. > > I am having a hard time thinking of a signal that can pass though > a TV audio channel (50 to 15 kHz) and TV speaker (even less than > that) that could trigger a toy an still be "inaudable". > > Can you be more specific?Well, all audio watermarking is based on steganography -- the hiding of information in another signal. There are a number of approaches, and they are usually parameterizable in terms of the amount of data that needs to be carried, the robustness against signal processing, and the audibility. The latter is strongly affected by the overall quality and nature of the "cover" signal. Our algorithm could be "dialed in" for a wide range of applications. At one end of the spectrum, we could hide quite a bit of data in CD quality music with no audibility issues except for some very highly trained listeners who knew exactly what they were looking for. For TV audio applications, the data requirements were much less, the robustness needed to be higher, and audibility was less of a concern. The low quality of audio in the soundtrack to begin with could hide a lot. You couldn't hear our signal in a TV speaker, but you could definitely hear that particular watermark in the CD quality test environment with a better cover signal. It was my job to build a very low-cost decoder that could be built into a toy. I ended up with a cheap electret microphone, about four stages of opamps for gain and filtering, and an 8-bit microcontroller (6502 derivative) to run the algorithm. We could play the audio through a cheap PC speaker at one end of a conference room table, and our box at the other end would light up a sequence of LEDs to show that it had seen the triggers, even while people in the room were talking. It was a pretty impressive demo, but I don't know whether it ever made its way into any real products. This was about six years ago. -- Dave Tweed