Forums

matching a reference audio signal in a VoIP environment

Started by tj December 22, 2009
hi,

i'm trying to match reference audio signals (ranging between 1 sec and
10 seconds) in source audio signal from an RTP stream.

i've used cross-correlation, which works extremely well (spikes
nicely) when there is not much time-altering of the reference audio as
it is passed over the RTP stream (i.e., not much packet loss or packet
re-ordering, and the local packet-loss-concealment / jitter buffers
are working their magic well).

however, when there is some modest time-altering of the reference
audio when it's passed over the RTP connection (as a result of an
overloaded network, etc.), the cross correlation doesn't spike (or not
very much). ...but, of course, listening to the recorded stream, you
can definitely hear the reference audio.

this is, i'm sure, obvious - but it leads me to my question: is there
a method to do this type of *simple* audio signal matching (i.e., not
speech recognition) that is both resilient against additive noise
(like cross correlation), but also works on slightly 'lossy' or 'time-
stretched' signals.

i'm about to evaluate speech-recognition style approaches (dynamic
time warping (DTW), HMMs, etc.), but i love cross-correlation's
simplicity...and i don't need all the extra features that go with
typical speech-rec approaches - i'm really looking to match a
*specific* segment of reference audio, with modest time-stretching or
lost audio...with reasonable precision.

i wonder if there's a DTW-style approach that i could patch onto an
FFT-based, uniformly partitioned, overlap & save-based cross-
correlation (for the long reference signal length)?

any ideas would be greatly appreciated. i'm sure this sort of problem
has been solved before...hasn't it?

thanks, tom

tj wrote:

> hi, > > i'm trying to match reference audio signals (ranging between 1 sec and > 10 seconds) in source audio signal from an RTP stream. > > i've used cross-correlation, which works extremely well (spikes > nicely) when there is not much time-altering of the reference audio as > it is passed over the RTP stream (i.e., not much packet loss or packet > re-ordering, and the local packet-loss-concealment / jitter buffers > are working their magic well). > > however, when there is some modest time-altering of the reference > audio when it's passed over the RTP connection (as a result of an > overloaded network, etc.), the cross correlation doesn't spike (or not > very much). ..
Are you trying to recognize a signal or to detect exact time of arrival? For TOA, break your signal into chunks of ~10ms and compute correlation for every chunk, then add. For recognition, correlate the spectrograms. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On Dec 22, 12:31&#2013266080;pm, tj <tomjohns...@gmail.com> wrote:
> hi, > > i'm trying to match reference audio signals (ranging between 1 sec and > 10 seconds) in source audio signal from an RTP stream. > > i've used cross-correlation, which works extremely well (spikes > nicely) when there is not much time-altering of the reference audio as > it is passed over the RTP stream (i.e., not much packet loss or packet > re-ordering, and the local packet-loss-concealment / jitter buffers > are working their magic well). > > however, when there is some modest time-altering of the reference > audio when it's passed over the RTP connection (as a result of an > overloaded network, etc.), the cross correlation doesn't spike (or not > very much). ...but, of course, listening to the recorded stream, you > can definitely hear the reference audio. > > this is, i'm sure, obvious - but it leads me to my question: is there > a method to do this type of *simple* audio signal matching (i.e., not > speech recognition) that is both resilient against additive noise > (like cross correlation), but also works on slightly 'lossy' or 'time- > stretched' signals. > > i'm about to evaluate speech-recognition style approaches (dynamic > time warping (DTW), HMMs, etc.), but i love cross-correlation's > simplicity...and i don't need all the extra features that go with > typical speech-rec approaches - i'm really looking to match a > *specific* segment of reference audio, with modest time-stretching or > lost audio...with reasonable precision. > > i wonder if there's a DTW-style approach that i could patch onto an > FFT-based, uniformly partitioned, overlap & save-based cross- > correlation (for the long reference signal length)? > > any ideas would be greatly appreciated. i'm sure this sort of problem > has been solved before...hasn't it? > > thanks, tom
How much money you got ? You get what you pay for...
On Dec 22, 12:31&#2013266080;pm, tj <tomjohns...@gmail.com> wrote:
> hi, > > i'm trying to match reference audio signals (ranging between 1 sec and > 10 seconds) in source audio signal from an RTP stream. > > i've used cross-correlation, which works extremely well (spikes > nicely) when there is not much time-altering of the reference audio as > it is passed over the RTP stream (i.e., not much packet loss or packet > re-ordering, and the local packet-loss-concealment / jitter buffers > are working their magic well). > > however, when there is some modest time-altering of the reference > audio when it's passed over the RTP connection (as a result of an > overloaded network, etc.), the cross correlation doesn't spike (or not > very much). ...but, of course, listening to the recorded stream, you > can definitely hear the reference audio. > > this is, i'm sure, obvious - but it leads me to my question: is there > a method to do this type of *simple* audio signal matching (i.e., not > speech recognition) that is both resilient against additive noise > (like cross correlation), but also works on slightly 'lossy' or 'time- > stretched' signals. > > i'm about to evaluate speech-recognition style approaches (dynamic > time warping (DTW), HMMs, etc.), but i love cross-correlation's > simplicity...and i don't need all the extra features that go with > typical speech-rec approaches - i'm really looking to match a > *specific* segment of reference audio, with modest time-stretching or > lost audio...with reasonable precision. > > i wonder if there's a DTW-style approach that i could patch onto an > FFT-based, uniformly partitioned, overlap & save-based cross- > correlation (for the long reference signal length)? > > any ideas would be greatly appreciated. i'm sure this sort of problem > has been solved before...hasn't it? > > thanks, tom
how about passing both the ref and recovered audio through identical low pass filters say around 3 kHz and then perform the cross correlation? Mark
> Are you trying to recognize a signal or to detect exact time of arrival?
it's more like recognition - i'm trying to match the template against incoming audio and trigger an event if it is matched. what makes this different from a general speech rec problem is that, at any point, i have a single/specific audio reference signal (which may or may not be speech) - and i'm looking to match just slight variations of that template. ...however, i guess it is still a recognition problem, just not a generalized one (i.e., not multi- speaker, multi-word, etc.).
> For TOA, break your signal into chunks of ~10ms and compute correlation > for every chunk, then add. > > For recognition, correlate the spectrograms.
so, just so i understand: for recognition, you're suggesting to split the audio into chunks, then correlate the spectrograms? should i do some windowing, etc.? or will that not matter much? also, if the audio is stretched or shrunk a bit, i'd imagine that i have to do a bit more "searching" using cross-correlation? e.g., match the "chunk's" spectrogram forward/back one, two or a few "chunks" (saving the max value) in order to capture a little shrink/ stretch of the signal?
> any ideas would be greatly appreciated. i'm sure this sort of problem > has been solved before...hasn't it? > > thanks, tom
You are not gonna find a canned routine for doing this type of template matching. And NO, you do not want to mess with speech reco stuff like DTW or HMM if simple cross-correlation works. It is my understanding that the only changes introduced into your audio waveforms (other than additive noise) are some additions and deletions of waveform chunks. (and not some parametric e.g. speech compression which can drastically change the waveshape leaving perceptual speech qualities intact - in this case cross-correlation would not work) Just give me one reason to spend time on this seemingly interesting problem and put together a canned routine in e.g. Matlab (other than the "love of dsp" cause I'm not a "dsp guy") Is there a mass-use application of this ? Give me some examples