Nangergong,
1. There are two signals: The original one, and the other signal which is the
output of
a VOIP channel. Due to the effects of coding algorithm and those of the channel,
the
second signal will not be identical to the first one, but will be
similar.
I will explain with an example.
Consider G.729, where the average value of pitch period is transmitted to the
decoder,
for each block of the input signal. This average value of pitch period is used
to construct
the excitation at the other end.
In voiced speech, there is a major excitation in every pitch period.
In fact, the time interval between two successive major excitations is the pitch
period.
At the decoder, the average pitch period is used instead of the exact value of
pitch period of the original signal, to construct the excitation.
Hence, the location of the major excitation changes by a few samples (say 2-4
samples at 8 or 16 kHz) in the constructed excitation at the decoder-end.
As a result, the output signal block of the decoder will not be exactly aligned
with the input signal block.
This small shift of a few samples is enough to bring down the cross correlation
values significantly.
Thus the input and output are not similar sample-wise.
Note that both the signals are perceived as very similar or identical when you
listen to them.
But they are not identical sample-wise (even after you compensate for the
delay).
Hence, your expectations of high correlation may not be reasonable.
2. Assume that you use full lengths of both the signals for computing the
cross-correlation
sequence (CCS).
Let the peak value of CCS in this case be P1 (which you are getting as 0.1 to
0.3).
Now, break up each signal into blocks of smaller duration of 50 ms or 100 ms,
and then
compute CCS using the corresponding blocks of both the signals.
Let the peak value of CCS in this case be P2.
You will observe P2 to be greater than P1.
You may expect P2 to be about 0.3 to 0.5.
I hope that the explanation helps.
Regards,
Guruprasad
--- On Sat, 12/2/11, Jeff Brower wrote:
From: Jeff Brower
Subject: Re: [speechcoding] data format of PCM 16bit signed mono-channel
and alignment(line-up)two PCM-16bit .wav files.
To: s...
Cc: "Nangergong"
Date: Saturday, 12 February, 2011, 2:54 AM
Nangergong-
The "cross-correlation coefficient" sounds like some type of overall value.
I'm not sure what program you're using
and what is this value.
But no matter. What you want to do is simply display the resulting cross
correlation output (it's time domain domain
output, so just another .wav file) and find out where the peak occurred. The
time value of the peak is the delay.
If you send me your two .wav files (just the first few sec of each one is fine,
assuming 8 kHz sampling rate for voice
data), then I can run cross correlation in Hypersignal and post the resulting
waveform display. Then for sure you can
see what I mean.
-Jeff
> I have some .wav files with format of PCM 16bit
signed mono-channel
>
> from this link:
>
https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
>
>
> I know each sample of these .wav files(except
the header) can be
> converted to value -32768 to 32767
>
> I want to time-align two .wav files with the
format mentioned above. My
> steps are:
>
> 1) convert .wav files into .raw files(namely,
remove the header)
>
> 2) convert each sample of the two .raw files
into a signed value
> between -32768 to 32767, and get two files with
signed numbers
>
> I use command line on linux platform:
> od -A n -s -w2 -v a.raw > a
>
> 3) utilize cross-correlation function to get the
position of the
> maximum correlation coefficient and deduce the
offset.
>
> However, I found that the maximum
cross-correlation coefficient is very
> low, roughly from 0.1 to 0.3. One of the two .wav
files is original file and
> the other is recorded from the VOIP system where the
original file is used
> as input. The codec of the VOIP system is either G711
or G729.
>
> My question is:
>
> 1) is the low value of the maximum correlation
coefficient sensible or
> not? From the waveforms of these two files, I think
they are nearly the
> same, and thus the correlation should be big.
>
> 2) or is there any problem in converting each
sample of the two .raw
> files into a signed value?
>
> Thank you so much!