[..]
> If this can be done at all in an analytic manner, it can be done by
> deconvoling a pair of (short) samples from the two audio streams against
> each other. (meaning deconv(a,b) and then deconv(b,a)). There is a
> complication with that I'll address in a later paragraph.
>
> Once you get the deconvolution signature, pop it up in a wave
> editor. The first "tallest spike" is the best guess for a good
> place to offset the two.
>
> Ok - the complication . deconvolution requires the "a" signal to be
> shorter than the "b" signal. So you have to clip accordingly. And it may
> be that the resulting deconvolution signature is indeterminate.

Thanks for pointing me to a new direction.

Actually, I've found a solution using spectrograms comparisons as I
tried to describe it in my other posts, but now restricting the search
to few seconds of shifting between them. It works nice so far for all
the pairing of video and audio I tested. But I'm feeling that it is not
optimal and could fail for certain sounds (if there are no enough
similarity in the spectrograms for example.)

So, I will look for deconvolution. I'm not familiar with these
transformations thought.

The idea is that one sound can be seen as a convolution of the other
one? But does this can handle imperfect recording? For example, one
recorder produce lot of noise. On the other hand, the other recorder is
more sensitive to wind. And of course, they have both different
frequency response, and even different recording level.

What do you suggest as size for both argument 'a' and 'b' for deconv?

My first attempt was to try with such data: (44.1KHz sounds)

 octave:1> u = wavread("sound1.wav")(:,1);
 octave:2> v = wavread("sound2.wav")(:,1);
 octave:3> size(u), size(v)
 ans =
    5500495         1
 ans =
    5263002         1

But then trying various plot(deconv()) with various excerpt from 'u' and
'v' haven't produced obvious answer (even when testing excerpts from the
same sound as both argument). Would you mind to provide more details?

[..]
> With MATLAB/Octave, you might be able to automagically do the whole
> thing.

I'm using Octave and NumPy for my tests (mostly the latter, because I'm
used to Python, but I use Octave for simpler computation.)

-- 
Fr&#4294967295;d&#4294967295;ric Jolliton

Fr&#4294967295;d&#4294967295;ric Jolliton wrote:
> Hi,
> 
> I'm looking for a method to automatically synchronize various audio
> tracks, recorded at the same place, with different devices. This is
> intended to work at post-processing time (not in realtime.)
> 
> Basically, I'm taking two audios track: one recorded by a camcorder,
> with poor mic quality, and an extra one recorded at the same time with a
> dedicated sound recorder, recording the same thing.
> 
> My naive approach is as follow: I compute spectogram for both sound
> (using FFT) which give me a 2D array for each spectogram, then I try
> various shift to find the best match between them.
> 
> To compare two spectograms with a given shift, I take the overlapping
> parts once shifted, then I compute the mean value of the absolute
> difference between them, which I divide by the width of the overlap.
> (Hope that make sense.. I'm lacking adequate terminology here.)
> 
> Then I keep the best (smaller) answer found while trying various shift.
> 
> This method seems to work well for few tests I made. When plotting the
> value against the shift, I see a peak toward 0 where audio tracks match.
> 
> Is there a more robust and more efficient way to find how much time
> separate 2 (or more) audio samples recorded at the same place, but with
> different devices?
> 

If this can be done at all in an analytic manner, it can be done by
deconvoling a pair of (short) samples from the two audio streams against
each other. (meaning deconv(a,b) and then deconv(b,a)). There is a
complication with that I'll address in a later paragraph.

Once you get the deconvolution signature, pop it up in a wave
editor. The first "tallest spike" is the best guess for a good
place to offset the two.

Ok - the complication . deconvolution requires the "a" signal to be
shorter than the "b" signal. So you have to clip accordingly. And it may
be that the resulting deconvolution signature is indeterminate.

Search for "Voxengo deconvolver" for free software that does
deconvolution for you, unless you have MATLAB or Octave.

With MATLAB/Octave, you might be able to automagically do the whole
thing.
--
Les Cargill

--
Les Cargill

[Lip sync]
> So I don't understand why you need an "algorithm" to do this, it is
> usually simply done by hand and eye manualy in an editor...

The problem is that I can have tens of video footages per day to sync.
This is why I try to automate this process. Video with poor audio
quality come from a camera, while better sound come from a separate
dedicated device. I want to pass all these files to my program, which
pair them by their creation date (since both device record the time &
date) then I'm deducing by which amount of time the track are out of
sync, so that I can create new video footage with the better sound. Then
I can use the resulting video in a video editor as usual.

By hand, it would be too laborious.

> as well as rec.audio.pro, you can also try
> rec.audio.movies.production.sound   (It's RAMPS)

Ok.

-- 
Fr&#4294967295;d&#4294967295;ric Jolliton

On Oct 27, 12:09&#4294967295;pm, Fr&#4294967295;d&#4294967295;ric Jolliton
<comp....@frederic.jolliton.com> wrote:
> > be aware that there will not only be a time offset between these two
> > recordings but there MAY also be a SPEED offset. [..] If the clocks
> > were off by 100 ppm, after 1 hour the time error could be a
> > significant fraction of a second.
>
> Indeed, that's actually what my (rough) measures show (0.1% drift from
> the expected sampling rate). However, I'm processing only short (few
> minutes maximum) footage, so I don't know worry too much about that.
>
> > Depending upon your purpose, lip sync or stereo image phasing, this
> > will be significant.
>
> Actually, I'm shooting some video footages while recording the sound
> with an external device then I try to merge them using the original
> sound track (with poor quality) as a mean to properly lip sync the new
> audio track.
>
> > If this is critical to you, it would be best if the algorithm could be
> > made continously adaptive so that it would track and speed
> > differences.
>
> Maybe this sort of thing could be handled by searching various chunk
> (say 1 minutes worth of samples) from one audio track into the other
> one, and deducing from the various resulting place how the clock drifted
> between the two.
>
> > This technique is a common practice in audio production and you may
> > get more insight asking at rec.audio.pro.
>
> Ok, I will check there. Thanks!
>
> --
> Fr&#4294967295;d&#4294967295;ric Jolliton

if your purpose is to obtain "lip sync" over the time period of a few
minutes, i would agree with most modern gear you can ignore the "speed
error" and simply match the time offset at any point in the clip and
it will remain in sync throughout the rest of the clip well enough for
"lip sync" purposes.

So I don't understand why you need an "algorithm" to do this, it is
usually simply done by hand and eye manualy in an editor...

as well as rec.audio.pro, you can also try
rec.audio.movies.production.sound   (It's RAMPS)

Mark

> be aware that there will not only be a time offset between these two
> recordings but there MAY also be a SPEED offset. [..] If the clocks
> were off by 100 ppm, after 1 hour the time error could be a
> significant fraction of a second.

Indeed, that's actually what my (rough) measures show (0.1% drift from
the expected sampling rate). However, I'm processing only short (few
minutes maximum) footage, so I don't know worry too much about that.

> Depending upon your purpose, lip sync or stereo image phasing, this
> will be significant.

Actually, I'm shooting some video footages while recording the sound
with an external device then I try to merge them using the original
sound track (with poor quality) as a mean to properly lip sync the new
audio track.

> If this is critical to you, it would be best if the algorithm could be
> made continously adaptive so that it would track and speed
> differences.

Maybe this sort of thing could be handled by searching various chunk
(say 1 minutes worth of samples) from one audio track into the other
one, and deducing from the various resulting place how the clock drifted
between the two.

> This technique is a common practice in audio production and you may
> get more insight asking at rec.audio.pro.

Ok, I will check there. Thanks!

-- 
Fr&#4294967295;d&#4294967295;ric Jolliton

On Oct 25, 7:42=A0am, Fr=E9d=E9ric Jolliton <comp....@frederic.jolliton.com=
>
wrote:
> [..]
>
> >> Is there a more robust and more efficient way to find how much time
> >> separate 2 (or more) audio samples recorded at the same place, but wit=
h
> >> different devices?
>
> Note: to get a better idea of the spectogram I working with, see:
>
> =A0http://tuxee.net/tmp/audiosync
>
> The thing is that visually I can easily match them, but I do not see how
> to translate that numerically, avoiding noise and other perturbations.
>
> > What you do is pretty robust although computationally heavy. Just a
> > couple of suggestions:
>
> > 1. Compare not the spectrograms, but the time derivatives of the
> > spectrograms. That will cancel the static frequency skew.
>
> Are you suggesting something like:
>
> =A0 spectogram =3D spectogram(2:end,:) - spectogram(1:end-1,:);
>
> (using Matlab/Octave syntax) assuming first dimension is the time axis
> and the second dimension is the frequency axis? I've tested it, but the
> result is less accurate. See:
>
> =A0http://tuxee.net/tmp/spect-deriv.png
>
> (Here, the expected answer is around 1855) I've slightly scaled one of
> the graph to match the other one. The red graph clearly indicate the
> expected answer, but the green one does not. Maybe you were talking
> about a different computation?
>
> > 2. Normalize the spectrograms wrt the power of the signals. That will
> > make the simularity measure independent of the volume.
>
> How to compute the power of the signals? Actually, while both sound are
> close together, one of the record can be more sensitive to wind, or one
> can record sound not heard by the other one (but note that both
> recorders are less than 1 meter apart), so it might be hard to find how
> to normalize them together.
>
> For the spectograms, I'm actually working in the log scale for the
> amplitudes, so when comparing them by subtracting component together,
> this should cancel possible volume difference a bit I guess. (Better
> than if I was keeping FFT output without applying log when computing the
> spectogram.)
>
> > It could be possible to make an adaptive filter to minimize the
> > difference between the audio streams. After the filter is converged,
> > it is simple enough to derive the time shift from the coefficients.
>
> While usually the sound will be few seconds apart, I don't know if a
> adaptive filter would work with larger time difference. But I'm not
> familiar enough with such filters thought.
>
> > That would be much more accurate and less computationally demanding,
> > then the spectrograms. However it works only if the streams are
> > sufficiently correlated for adaptive algorithm to work. To my
> > experience, the correlation between two different microphones is
> > rather low, especially if there is a lossy compression on the way.
>
> I'm not using compression (both sound are PCM, one is recorded at
> 44.1Khz, mono, 16 bits, and the other one at 96KHz, stereo, 24 bits, but
> downsampled to the type of the first one before processing it) but as
> you guessed it, unfortunately one mic can be more sensitive to some
> sound (like wind for example) and this can complicate the computation.
> (See my first link at the top.)
>
> Thanks for your help!
>
> --
> Fr=E9d=E9ric Jolliton

Frederic,
be aware that there will not only be a time offset between these two
recordings but there MAY also be a SPEED offset.  Consider that the
two recording devices each have a tolerance to their internal clocks
relative to the playback machine.  If you achieved perfect time
alignment at the start of the recordings, and the recordings are say 1
hour long, you may find that after 1 hour they are no longer in time
alignemnt.  If the clocks were off by 100 ppm, after 1 hour the time
error could be a significant fraction of a second.  Depending upon
your purpose, lip sync or stereo image phasing, this will be
significant.  If this is critical to you, it would be best if the
algorithm could be made continously adaptive so that it would track
and speed differences.  Fortunalty with modern gear, the speed
differences should be very small.

Also remember that if the two recordings were made with different
microphones, there will be about 1 ms per foot offset in time due to
the speed of sound.  So it depends what your goal is in combining
these two recordings.

This technique is a common practice in audio production and you may
get more insight asking at rec.audio.pro.

Mark


Mark

[..]
>> Is there a more robust and more efficient way to find how much time
>> separate 2 (or more) audio samples recorded at the same place, but with
>> different devices?

Note: to get a better idea of the spectogram I working with, see:

  http://tuxee.net/tmp/audiosync

The thing is that visually I can easily match them, but I do not see how
to translate that numerically, avoiding noise and other perturbations.

> What you do is pretty robust although computationally heavy. Just a
> couple of suggestions:
>
> 1. Compare not the spectrograms, but the time derivatives of the
> spectrograms. That will cancel the static frequency skew.

Are you suggesting something like:

  spectogram = spectogram(2:end,:) - spectogram(1:end-1,:);

(using Matlab/Octave syntax) assuming first dimension is the time axis
and the second dimension is the frequency axis? I've tested it, but the
result is less accurate. See:

  http://tuxee.net/tmp/spect-deriv.png

(Here, the expected answer is around 1855) I've slightly scaled one of
the graph to match the other one. The red graph clearly indicate the
expected answer, but the green one does not. Maybe you were talking
about a different computation?

> 2. Normalize the spectrograms wrt the power of the signals. That will
> make the simularity measure independent of the volume.

How to compute the power of the signals? Actually, while both sound are
close together, one of the record can be more sensitive to wind, or one
can record sound not heard by the other one (but note that both
recorders are less than 1 meter apart), so it might be hard to find how
to normalize them together.

For the spectograms, I'm actually working in the log scale for the
amplitudes, so when comparing them by subtracting component together,
this should cancel possible volume difference a bit I guess. (Better
than if I was keeping FFT output without applying log when computing the
spectogram.)

> It could be possible to make an adaptive filter to minimize the
> difference between the audio streams. After the filter is converged,
> it is simple enough to derive the time shift from the coefficients.

While usually the sound will be few seconds apart, I don't know if a
adaptive filter would work with larger time difference. But I'm not
familiar enough with such filters thought.

> That would be much more accurate and less computationally demanding,
> then the spectrograms. However it works only if the streams are
> sufficiently correlated for adaptive algorithm to work. To my
> experience, the correlation between two different microphones is
> rather low, especially if there is a lossy compression on the way.

I'm not using compression (both sound are PCM, one is recorded at
44.1Khz, mono, 16 bits, and the other one at 96KHz, stereo, 24 bits, but
downsampled to the type of the first one before processing it) but as
you guessed it, unfortunately one mic can be more sensitive to some
sound (like wind for example) and this can complicate the computation.
(See my first link at the top.)

Thanks for your help!

-- 
Fr&#4294967295;d&#4294967295;ric Jolliton

Fr&#4294967295;d&#4294967295;ric Jolliton wrote:

> I'm looking for a method to automatically synchronize various audio
> tracks, recorded at the same place, with different devices. This is
> intended to work at post-processing time (not in realtime.)
> 
> Basically, I'm taking two audios track: one recorded by a camcorder,
> with poor mic quality, and an extra one recorded at the same time with a
> dedicated sound recorder, recording the same thing.
> 
> My naive approach is as follow: I compute spectogram for both sound
> (using FFT) which give me a 2D array for each spectogram, then I try
> various shift to find the best match between them.
> 
> To compare two spectograms with a given shift, I take the overlapping
> parts once shifted, then I compute the mean value of the absolute
> difference between them, which I divide by the width of the overlap.
> (Hope that make sense.. I'm lacking adequate terminology here.)
> 
> Then I keep the best (smaller) answer found while trying various shift.
> 
> This method seems to work well for few tests I made. When plotting the
> value against the shift, I see a peak toward 0 where audio tracks match.
> 
> Is there a more robust and more efficient way to find how much time
> separate 2 (or more) audio samples recorded at the same place, but with
> different devices?

What you do is pretty robust although computationally heavy. Just a 
couple of suggestions:

1. Compare not the spectrograms, but the time derivatives of the 
spectrograms. That will cancel the static frequency skew.

2. Normalize the spectrograms wrt the power of the signals. That will 
make the simularity measure independent of the volume.

It could be possible to make an adaptive filter to minimize the 
difference between the audio streams. After the filter is converged, it 
is simple enough to derive the time shift from the coefficients. That 
would be much more accurate and less computationally demanding, then the 
spectrograms. However it works only if the streams are sufficiently 
correlated for adaptive algorithm to work. To my experience, the 
correlation between two different microphones is rather low, especially 
if there is a lossy compression on the way.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Hi,

I'm looking for a method to automatically synchronize various audio
tracks, recorded at the same place, with different devices. This is
intended to work at post-processing time (not in realtime.)

Basically, I'm taking two audios track: one recorded by a camcorder,
with poor mic quality, and an extra one recorded at the same time with a
dedicated sound recorder, recording the same thing.

My naive approach is as follow: I compute spectogram for both sound
(using FFT) which give me a 2D array for each spectogram, then I try
various shift to find the best match between them.

To compare two spectograms with a given shift, I take the overlapping
parts once shifted, then I compute the mean value of the absolute
difference between them, which I divide by the width of the overlap.
(Hope that make sense.. I'm lacking adequate terminology here.)

Then I keep the best (smaller) answer found while trying various shift.

This method seems to work well for few tests I made. When plotting the
value against the shift, I see a peak toward 0 where audio tracks match.

Is there a more robust and more efficient way to find how much time
separate 2 (or more) audio samples recorded at the same place, but with
different devices?

-- 
Fr&#4294967295;d&#4294967295;ric Jolliton