I'm trying to examine the delay between two speech signals over time by using cross-correlation over small chunks. I'm doing this in python with the approach as described below:
chunk_size = 2secs # No preference yet
num_frames = len(signal)/chunk_size
for loop over num_frames
c = xcorr(x1,x2)
record index of max(abs(c) #shifting by -lags+1 to center at 0
move on to the next non-overlapping chunk/frame
Both inputs are the same length (L) and the xcorr is a full linear convolution (between x1 and flipped x2).
Doing this, i see a variation in the delay by about 50 to 100 samples. However, there is actually a constant delay between the two signals. The speech signals correspond to an acoustic echo captured by a microphone and the corresponding reference to the echo. By running these signals through an offline echo canceller, I see that the impulse response modeled by the echo canceler has not shifted, and shows a consistent peak through the entire period when echo exists. But this cross-correlation is not aligning with the reality. I trust that the observation from the echo canceller is reality because if there was really a shift in the delay by 50samples, the adaptive filter would completely be off and would have to re-converge, thereby introducing residual echo while it recovers from it's diverged state.
I suspect that this variation in delay is because the echo is really small, and the cross-correlation on some frames is kind of "smudged" around where it is maximum, i.e. there is no "one clear maximum", unless I run some type of smoothing of the cross-correlation itself. Smoothing reduces the jumps in the delay from 50~100 samples down to around 5~10samples, depending on the smoothing block size(moving average filter). However, I still cannot get a constant delay.
What is the best approach to pre-process the input signals and/or post-process the cross-correlation output to get a better estimate of the actual delay between the two signals?
I ended up using a normalized cross-correlation and the variation in delays from frame to frame (in valid frames where there should not be any delay variation) is down to 1 sample.
If I find some other improvements, I will post my final code/pseudocode here.
the echo-canceller adapts slowly over time. When fed with a broadband signal it will converge to the room impulse response with correct delays. If you feed the echo-canceller with narrow-band signals or if you make the learning coefficients large, you might see similar effects (or more likely instability).
When using this cross-correlation method of yours, you will get peaks at the delay position, but also peaks at other positions due to correlation within the signal itself.
Does your xcorr function zero-pad the signals when correlating or does it repeat the signal periodically? The first method (zero-padding) should be better for your application, I guess, but the problem remains...
Yes, the EC does adapt slowly, and I'm looking at a converged section without interfering double talk to ensure that the filter is stable. I should have added that the recordings are from an anechoic chamber, so there are little to no reflections, and the impulse response has a strong peak indicating the delay between the reference signal and the actual echo on the microphone.
Yes, I'm using full linear convolution, so zero-padding is taken care of.
I am not sure about your setup details but if you compare it to radar pulses then you must make sure you are correlating the relevant chunks of your signals, otherwise there is ambiguity.
I ignore noise frames by simple mean squared energy computation. But otherwise I compute the cross-correlation over the entire signal, but while looking at the output plot of all the "max_lags" or the index of max(xCorr), I only view the portion where the xCorr is reliable, i.e. active echo frames.
import os, sys, fnmatch, math
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile as wf
from scipy import signal
(fs,x) = wf.read(file1)
(fs,z) = wf.read(file2)
x = x/32768
z = z/32768
blk_size = int(5*20e-3*fs) # 5 20ms frames
delay = 
Exs = 
Ezs = 
xCorr = np.array
N = 20 # Moving average window
# ind = np.array
eps = 1e-6 # noise floor to avoid divide-by-zero error
for i in range(int(len(x)/blk_size)):
t = range(i*blk_size,(i+1)*blk_size)
x1 = x[t]
z1 = z[t]
Ex = 20*np.log10(np.mean(x1**2))
Ez = 20*np.log10(np.mean(z1**2)+eps)
if Ex < -160 or Ez < -96:
xCorr = signal.fftconvolve(x1, z1[::-1],mode='full')