data format of PCM 16bit signed mono-channel and alignment(line-up) two PCM-16bit .wav files.

Started by nangergong February 11, 2011
Hi, all:

I have some .wav files with format of PCM 16bit signed mono-channel

from this link:
https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
I know each sample of these .wav files(except the header) can be
converted to value -32768 to 32767

I want to time-align two .wav files with the format mentioned above. My
steps are:

1) convert .wav files into .raw files(namely, remove the header)

2) convert each sample of the two .raw files into a signed value
between -32768 to 32767, and get two files with signed numbers

I use command line on linux platform:
od -A n -s -w2 -v a.raw > a

3) utilize cross-correlation function to get the position of the
maximum correlation coefficient and deduce the offset.

However, I found that the maximum cross-correlation coefficient is very
low, roughly from 0.1 to 0.3. One of the two .wav files is original file and
the other is recorded from the VOIP system where the original file is used
as input. The codec of the VOIP system is either G711 or G729.

My question is:

1) is the low value of the maximum correlation coefficient sensible or
not? From the waveforms of these two files, I think they are nearly the
same, and thus the correlation should be big.

2) or is there any problem in converting each sample of the two .raw
files into a signed value?

Thank you so much!
Nangergong-

The "cross-correlation coefficient" sounds like some type of overall value. I'm not sure what program you're using
and what is this value.

But no matter. What you want to do is simply display the resulting cross correlation output (it's time domain domain
output, so just another .wav file) and find out where the peak occurred. The time value of the peak is the delay.

If you send me your two .wav files (just the first few sec of each one is fine, assuming 8 kHz sampling rate for voice
data), then I can run cross correlation in Hypersignal and post the resulting waveform display. Then for sure you can
see what I mean.

-Jeff

> I have some .wav files with format of PCM 16bit signed mono-channel
>
> from this link:
> https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
> I know each sample of these .wav files(except the header) can be
> converted to value -32768 to 32767
>
> I want to time-align two .wav files with the format mentioned above. My
> steps are:
>
> 1) convert .wav files into .raw files(namely, remove the header)
>
> 2) convert each sample of the two .raw files into a signed value
> between -32768 to 32767, and get two files with signed numbers
>
> I use command line on linux platform:
> od -A n -s -w2 -v a.raw > a
>
> 3) utilize cross-correlation function to get the position of the
> maximum correlation coefficient and deduce the offset.
>
> However, I found that the maximum cross-correlation coefficient is very
> low, roughly from 0.1 to 0.3. One of the two .wav files is original file and
> the other is recorded from the VOIP system where the original file is used
> as input. The codec of the VOIP system is either G711 or G729.
>
> My question is:
>
> 1) is the low value of the maximum correlation coefficient sensible or
> not? From the waveforms of these two files, I think they are nearly the
> same, and thus the correlation should be big.
>
> 2) or is there any problem in converting each sample of the two .raw
> files into a signed value?
>
> Thank you so much!
Nangergong,

1. There are two signals: The original one, and the other signal which is the output of
a VOIP channel. Due to the effects of coding algorithm and those of the channel, the
second signal will not be identical to the first one, but will be  similar.
I will explain with an example.

Consider G.729, where the average value of pitch period is transmitted to the decoder,
for each block of the input signal. This average value of pitch period is used to construct
the excitation at the other end.
In voiced speech, there is a major excitation in every pitch period.
In fact, the time interval between two successive major excitations is the pitch period.
At the decoder, the average pitch period is used instead of the exact value of pitch period of the original signal, to construct the excitation.
Hence, the location of the major excitation changes by a few samples (say 2-4 samples at 8 or 16 kHz) in the constructed excitation at the decoder-end.
As a result, the output signal block of the decoder will not be exactly aligned with the input signal block.
This small shift of a few samples is enough to bring down the cross correlation values significantly.
Thus the input and output are not similar sample-wise.

Note that both the signals are perceived as very similar or identical when you listen to them.
But they are not identical sample-wise (even after you compensate for the delay).

Hence, your expectations of high correlation may not be reasonable.

2. Assume that you use full lengths of both the signals for computing the cross-correlation
 sequence (CCS).
Let the peak value of CCS in this case be P1 (which you are getting as 0.1 to 0.3).
Now, break up each signal into blocks of smaller duration of 50 ms or 100 ms, and then
compute CCS using the corresponding blocks of both the signals.
Let the peak value of CCS in this case be P2.
You will observe P2 to be greater than P1.
You may expect P2 to be about 0.3 to 0.5.

I hope that the explanation helps.

Regards,
Guruprasad
--- On Sat, 12/2/11, Jeff Brower wrote:

From: Jeff Brower
Subject: Re: [speechcoding] data format of PCM 16bit signed mono-channel and alignment(line-up)two PCM-16bit .wav files.
To: s...
Cc: "Nangergong"
Date: Saturday, 12 February, 2011, 2:54 AM

 

Nangergong-

The "cross-correlation coefficient" sounds like some type of overall value. I'm not sure what program you're using

and what is this value.

But no matter. What you want to do is simply display the resulting cross correlation output (it's time domain domain

output, so just another .wav file) and find out where the peak occurred. The time value of the peak is the delay.

If you send me your two .wav files (just the first few sec of each one is fine, assuming 8 kHz sampling rate for voice

data), then I can run cross correlation in Hypersignal and post the resulting waveform display. Then for sure you can

see what I mean.

-Jeff

> I have some .wav files with format of PCM 16bit signed mono-channel

>

> from this link:

> https://ccrma.stanford.edu/courses/422/projects/WaveFormat/

>

>

> I know each sample of these .wav files(except the header) can be

> converted to value -32768 to 32767

>

> I want to time-align two .wav files with the format mentioned above. My

> steps are:

>

> 1) convert .wav files into .raw files(namely, remove the header)

>

> 2) convert each sample of the two .raw files into a signed value

> between -32768 to 32767, and get two files with signed numbers

>

> I use command line on linux platform:

> od -A n -s -w2 -v a.raw > a

>

> 3) utilize cross-correlation function to get the position of the

> maximum correlation coefficient and deduce the offset.

>

> However, I found that the maximum cross-correlation coefficient is very

> low, roughly from 0.1 to 0.3. One of the two .wav files is original file and

> the other is recorded from the VOIP system where the original file is used

> as input. The codec of the VOIP system is either G711 or G729.

>

> My question is:

>

> 1) is the low value of the maximum correlation coefficient sensible or

> not? From the waveforms of these two files, I think they are nearly the

> same, and thus the correlation should be big.

>

> 2) or is there any problem in converting each sample of the two .raw

> files into a signed value?

>

> Thank you so much!