comp.dsp | matching a reference audio signal in a VoIP environment

hi,

i'm trying to match reference audio signals (ranging between 1 sec and
10 seconds) in source audio signal from an RTP stream.

i've used cross-correlation, which works extremely well (spikes
nicely) when there is not much time-altering of the reference audio as
it is passed over the RTP stream (i.e., not much packet loss or packet
re-ordering, and the local packet-loss-concealment / jitter buffers
are working their magic well).

however, when there is some modest time-altering of the reference
audio when it's passed over the RTP connection (as a result of an
overloaded network, etc.), the cross correlation doesn't spike (or not
very much). ...but, of course, listening to the recorded stream, you
can definitely hear the reference audio.

this is, i'm sure, obvious - but it leads me to my question: is there
a method to do this type of *simple* audio signal matching (i.e., not
speech recognition) that is both resilient against additive noise
(like cross correlation), but also works on slightly 'lossy' or 'time-
stretched' signals.

i'm about to evaluate speech-recognition style approaches (dynamic
time warping (DTW), HMMs, etc.), but i love cross-correlation's
simplicity...and i don't need all the extra features that go with
typical speech-rec approaches - i'm really looking to match a
*specific* segment of reference audio, with modest time-stretching or
lost audio...with reasonable precision.

i wonder if there's a DTW-style approach that i could patch onto an
FFT-based, uniformly partitioned, overlap & save-based cross-
correlation (for the long reference signal length)?

any ideas would be greatly appreciated. i'm sure this sort of problem
has been solved before...hasn't it?

thanks, tom

Reply by Vladimir Vassilevsky ●December 22, 20092009-12-22


tj wrote:

> hi,
> 
> i'm trying to match reference audio signals (ranging between 1 sec and
> 10 seconds) in source audio signal from an RTP stream.
> 
> i've used cross-correlation, which works extremely well (spikes
> nicely) when there is not much time-altering of the reference audio as
> it is passed over the RTP stream (i.e., not much packet loss or packet
> re-ordering, and the local packet-loss-concealment / jitter buffers
> are working their magic well).
> 
> however, when there is some modest time-altering of the reference
> audio when it's passed over the RTP connection (as a result of an
> overloaded network, etc.), the cross correlation doesn't spike (or not
> very much). ..

Are you trying to recognize a signal or to detect exact time of arrival?

For TOA, break your signal into chunks of ~10ms and compute correlation 
for every chunk, then add.

For recognition, correlate the spectrograms.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by fatalist ●December 22, 20092009-12-22

On Dec 22, 12:31&#4294967295;pm, tj <tomjohns...@gmail.com> wrote:
> hi,
>
> i'm trying to match reference audio signals (ranging between 1 sec and
> 10 seconds) in source audio signal from an RTP stream.
>
> i've used cross-correlation, which works extremely well (spikes
> nicely) when there is not much time-altering of the reference audio as
> it is passed over the RTP stream (i.e., not much packet loss or packet
> re-ordering, and the local packet-loss-concealment / jitter buffers
> are working their magic well).
>
> however, when there is some modest time-altering of the reference
> audio when it's passed over the RTP connection (as a result of an
> overloaded network, etc.), the cross correlation doesn't spike (or not
> very much). ...but, of course, listening to the recorded stream, you
> can definitely hear the reference audio.
>
> this is, i'm sure, obvious - but it leads me to my question: is there
> a method to do this type of *simple* audio signal matching (i.e., not
> speech recognition) that is both resilient against additive noise
> (like cross correlation), but also works on slightly 'lossy' or 'time-
> stretched' signals.
>
> i'm about to evaluate speech-recognition style approaches (dynamic
> time warping (DTW), HMMs, etc.), but i love cross-correlation's
> simplicity...and i don't need all the extra features that go with
> typical speech-rec approaches - i'm really looking to match a
> *specific* segment of reference audio, with modest time-stretching or
> lost audio...with reasonable precision.
>
> i wonder if there's a DTW-style approach that i could patch onto an
> FFT-based, uniformly partitioned, overlap & save-based cross-
> correlation (for the long reference signal length)?
>
> any ideas would be greatly appreciated. i'm sure this sort of problem
> has been solved before...hasn't it?
>
> thanks, tom

How much money you got ?

You get what you pay for...

Reply by Mark ●December 22, 20092009-12-22

On Dec 22, 12:31&#4294967295;pm, tj <tomjohns...@gmail.com> wrote:
> hi,
>
> i'm trying to match reference audio signals (ranging between 1 sec and
> 10 seconds) in source audio signal from an RTP stream.
>
> i've used cross-correlation, which works extremely well (spikes
> nicely) when there is not much time-altering of the reference audio as
> it is passed over the RTP stream (i.e., not much packet loss or packet
> re-ordering, and the local packet-loss-concealment / jitter buffers
> are working their magic well).
>
> however, when there is some modest time-altering of the reference
> audio when it's passed over the RTP connection (as a result of an
> overloaded network, etc.), the cross correlation doesn't spike (or not
> very much). ...but, of course, listening to the recorded stream, you
> can definitely hear the reference audio.
>
> this is, i'm sure, obvious - but it leads me to my question: is there
> a method to do this type of *simple* audio signal matching (i.e., not
> speech recognition) that is both resilient against additive noise
> (like cross correlation), but also works on slightly 'lossy' or 'time-
> stretched' signals.
>
> i'm about to evaluate speech-recognition style approaches (dynamic
> time warping (DTW), HMMs, etc.), but i love cross-correlation's
> simplicity...and i don't need all the extra features that go with
> typical speech-rec approaches - i'm really looking to match a
> *specific* segment of reference audio, with modest time-stretching or
> lost audio...with reasonable precision.
>
> i wonder if there's a DTW-style approach that i could patch onto an
> FFT-based, uniformly partitioned, overlap & save-based cross-
> correlation (for the long reference signal length)?
>
> any ideas would be greatly appreciated. i'm sure this sort of problem
> has been solved before...hasn't it?
>
> thanks, tom

how about passing both the ref and recovered audio through identical
low pass filters say around 3 kHz and then perform the cross
correlation?

Mark

Reply by tj ●December 22, 20092009-12-22

> Are you trying to recognize a signal or to detect exact time of arrival?

it's more like recognition - i'm trying to match the template against
incoming audio and trigger an event if it is matched.

what makes this different from a general speech rec problem is that,
at any point, i have a single/specific audio reference signal (which
may or may not be speech) - and i'm looking to match just slight
variations of that template. ...however, i guess it is still a
recognition problem, just not a generalized one (i.e., not multi-
speaker, multi-word, etc.).

> For TOA, break your signal into chunks of ~10ms and compute correlation
> for every chunk, then add.
>
> For recognition, correlate the spectrograms.

so, just so i understand: for recognition, you're suggesting to split
the audio into chunks, then correlate the spectrograms? should i do
some windowing, etc.? or will that not matter much?

also, if the audio is stretched or shrunk a bit, i'd imagine that i
have to do a bit more "searching" using cross-correlation? e.g., match
the "chunk's" spectrogram forward/back one, two or a few
"chunks" (saving the max value) in order to capture a little shrink/
stretch of the signal?

Reply by fatalist ●December 23, 20092009-12-23

> any ideas would be greatly appreciated. i'm sure this sort of problem
> has been solved before...hasn't it?
>
> thanks, tom

You are not gonna find a canned routine for doing this type of
template matching.
And NO, you do not want to mess with speech reco stuff like DTW or HMM
if simple cross-correlation works.
It is my understanding that the only changes introduced into your
audio waveforms (other than additive noise) are some additions and
deletions of waveform chunks.
(and not some parametric e.g. speech compression which can drastically
change the waveshape leaving perceptual speech qualities intact - in
this case cross-correlation would not work)

Just give me one reason to spend time on this seemingly interesting
problem and put together a canned routine in e.g. Matlab (other than
the "love of dsp" cause I'm not a "dsp guy")
Is there a mass-use application of this ?
Give me some examples

matching a reference audio signal in a VoIP environment

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group