Reply by November 1, 20092009-11-01
[..]
> If this can be done at all in an analytic manner, it can be done by > deconvoling a pair of (short) samples from the two audio streams against > each other. (meaning deconv(a,b) and then deconv(b,a)). There is a > complication with that I'll address in a later paragraph. > > Once you get the deconvolution signature, pop it up in a wave > editor. The first "tallest spike" is the best guess for a good > place to offset the two. > > Ok - the complication . deconvolution requires the "a" signal to be > shorter than the "b" signal. So you have to clip accordingly. And it may > be that the resulting deconvolution signature is indeterminate.
Thanks for pointing me to a new direction. Actually, I've found a solution using spectrograms comparisons as I tried to describe it in my other posts, but now restricting the search to few seconds of shifting between them. It works nice so far for all the pairing of video and audio I tested. But I'm feeling that it is not optimal and could fail for certain sounds (if there are no enough similarity in the spectrograms for example.) So, I will look for deconvolution. I'm not familiar with these transformations thought. The idea is that one sound can be seen as a convolution of the other one? But does this can handle imperfect recording? For example, one recorder produce lot of noise. On the other hand, the other recorder is more sensitive to wind. And of course, they have both different frequency response, and even different recording level. What do you suggest as size for both argument 'a' and 'b' for deconv? My first attempt was to try with such data: (44.1KHz sounds) octave:1> u = wavread("sound1.wav")(:,1); octave:2> v = wavread("sound2.wav")(:,1); octave:3> size(u), size(v) ans = 5500495 1 ans = 5263002 1 But then trying various plot(deconv()) with various excerpt from 'u' and 'v' haven't produced obvious answer (even when testing excerpts from the same sound as both argument). Would you mind to provide more details? [..]
> With MATLAB/Octave, you might be able to automagically do the whole > thing.
I'm using Octave and NumPy for my tests (mostly the latter, because I'm used to Python, but I use Octave for simpler computation.) -- Fr�d�ric Jolliton
Reply by Les Cargill October 31, 20092009-10-31
Fr�d�ric Jolliton wrote:
> Hi, > > I'm looking for a method to automatically synchronize various audio > tracks, recorded at the same place, with different devices. This is > intended to work at post-processing time (not in realtime.) > > Basically, I'm taking two audios track: one recorded by a camcorder, > with poor mic quality, and an extra one recorded at the same time with a > dedicated sound recorder, recording the same thing. > > My naive approach is as follow: I compute spectogram for both sound > (using FFT) which give me a 2D array for each spectogram, then I try > various shift to find the best match between them. > > To compare two spectograms with a given shift, I take the overlapping > parts once shifted, then I compute the mean value of the absolute > difference between them, which I divide by the width of the overlap. > (Hope that make sense.. I'm lacking adequate terminology here.) > > Then I keep the best (smaller) answer found while trying various shift. > > This method seems to work well for few tests I made. When plotting the > value against the shift, I see a peak toward 0 where audio tracks match. > > Is there a more robust and more efficient way to find how much time > separate 2 (or more) audio samples recorded at the same place, but with > different devices? >
If this can be done at all in an analytic manner, it can be done by deconvoling a pair of (short) samples from the two audio streams against each other. (meaning deconv(a,b) and then deconv(b,a)). There is a complication with that I'll address in a later paragraph. Once you get the deconvolution signature, pop it up in a wave editor. The first "tallest spike" is the best guess for a good place to offset the two. Ok - the complication . deconvolution requires the "a" signal to be shorter than the "b" signal. So you have to clip accordingly. And it may be that the resulting deconvolution signature is indeterminate. Search for "Voxengo deconvolver" for free software that does deconvolution for you, unless you have MATLAB or Octave. With MATLAB/Octave, you might be able to automagically do the whole thing. -- Les Cargill -- Les Cargill
Reply by October 29, 20092009-10-29
[Lip sync]
> So I don't understand why you need an "algorithm" to do this, it is > usually simply done by hand and eye manualy in an editor...
The problem is that I can have tens of video footages per day to sync. This is why I try to automate this process. Video with poor audio quality come from a camera, while better sound come from a separate dedicated device. I want to pass all these files to my program, which pair them by their creation date (since both device record the time & date) then I'm deducing by which amount of time the track are out of sync, so that I can create new video footage with the better sound. Then I can use the resulting video in a video editor as usual. By hand, it would be too laborious.
> as well as rec.audio.pro, you can also try > rec.audio.movies.production.sound (It's RAMPS)
Ok. -- Fr�d�ric Jolliton
Reply by Mark October 29, 20092009-10-29
On Oct 27, 12:09�pm, Fr�d�ric Jolliton
<comp....@frederic.jolliton.com> wrote:
> > be aware that there will not only be a time offset between these two > > recordings but there MAY also be a SPEED offset. [..] If the clocks > > were off by 100 ppm, after 1 hour the time error could be a > > significant fraction of a second. > > Indeed, that's actually what my (rough) measures show (0.1% drift from > the expected sampling rate). However, I'm processing only short (few > minutes maximum) footage, so I don't know worry too much about that. > > > Depending upon your purpose, lip sync or stereo image phasing, this > > will be significant. > > Actually, I'm shooting some video footages while recording the sound > with an external device then I try to merge them using the original > sound track (with poor quality) as a mean to properly lip sync the new > audio track. > > > If this is critical to you, it would be best if the algorithm could be > > made continously adaptive so that it would track and speed > > differences. > > Maybe this sort of thing could be handled by searching various chunk > (say 1 minutes worth of samples) from one audio track into the other > one, and deducing from the various resulting place how the clock drifted > between the two. > > > This technique is a common practice in audio production and you may > > get more insight asking at rec.audio.pro. > > Ok, I will check there. Thanks! > > -- > Fr&#4294967295;d&#4294967295;ric Jolliton
if your purpose is to obtain "lip sync" over the time period of a few minutes, i would agree with most modern gear you can ignore the "speed error" and simply match the time offset at any point in the clip and it will remain in sync throughout the rest of the clip well enough for "lip sync" purposes. So I don't understand why you need an "algorithm" to do this, it is usually simply done by hand and eye manualy in an editor... as well as rec.audio.pro, you can also try rec.audio.movies.production.sound (It's RAMPS) Mark
Reply by October 27, 20092009-10-27
> be aware that there will not only be a time offset between these two > recordings but there MAY also be a SPEED offset. [..] If the clocks > were off by 100 ppm, after 1 hour the time error could be a > significant fraction of a second.
Indeed, that's actually what my (rough) measures show (0.1% drift from the expected sampling rate). However, I'm processing only short (few minutes maximum) footage, so I don't know worry too much about that.
> Depending upon your purpose, lip sync or stereo image phasing, this > will be significant.
Actually, I'm shooting some video footages while recording the sound with an external device then I try to merge them using the original sound track (with poor quality) as a mean to properly lip sync the new audio track.
> If this is critical to you, it would be best if the algorithm could be > made continously adaptive so that it would track and speed > differences.
Maybe this sort of thing could be handled by searching various chunk (say 1 minutes worth of samples) from one audio track into the other one, and deducing from the various resulting place how the clock drifted between the two.
> This technique is a common practice in audio production and you may > get more insight asking at rec.audio.pro.
Ok, I will check there. Thanks! -- Fr&#4294967295;d&#4294967295;ric Jolliton
Reply by Mark October 26, 20092009-10-26
On Oct 25, 7:42=A0am, Fr=E9d=E9ric Jolliton <comp....@frederic.jolliton.com=
>
wrote:
> [..] > > >> Is there a more robust and more efficient way to find how much time > >> separate 2 (or more) audio samples recorded at the same place, but wit=
h
> >> different devices? > > Note: to get a better idea of the spectogram I working with, see: > > =A0http://tuxee.net/tmp/audiosync > > The thing is that visually I can easily match them, but I do not see how > to translate that numerically, avoiding noise and other perturbations. > > > What you do is pretty robust although computationally heavy. Just a > > couple of suggestions: > > > 1. Compare not the spectrograms, but the time derivatives of the > > spectrograms. That will cancel the static frequency skew. > > Are you suggesting something like: > > =A0 spectogram =3D spectogram(2:end,:) - spectogram(1:end-1,:); > > (using Matlab/Octave syntax) assuming first dimension is the time axis > and the second dimension is the frequency axis? I've tested it, but the > result is less accurate. See: > > =A0http://tuxee.net/tmp/spect-deriv.png > > (Here, the expected answer is around 1855) I've slightly scaled one of > the graph to match the other one. The red graph clearly indicate the > expected answer, but the green one does not. Maybe you were talking > about a different computation? > > > 2. Normalize the spectrograms wrt the power of the signals. That will > > make the simularity measure independent of the volume. > > How to compute the power of the signals? Actually, while both sound are > close together, one of the record can be more sensitive to wind, or one > can record sound not heard by the other one (but note that both > recorders are less than 1 meter apart), so it might be hard to find how > to normalize them together. > > For the spectograms, I'm actually working in the log scale for the > amplitudes, so when comparing them by subtracting component together, > this should cancel possible volume difference a bit I guess. (Better > than if I was keeping FFT output without applying log when computing the > spectogram.) > > > It could be possible to make an adaptive filter to minimize the > > difference between the audio streams. After the filter is converged, > > it is simple enough to derive the time shift from the coefficients. > > While usually the sound will be few seconds apart, I don't know if a > adaptive filter would work with larger time difference. But I'm not > familiar enough with such filters thought. > > > That would be much more accurate and less computationally demanding, > > then the spectrograms. However it works only if the streams are > > sufficiently correlated for adaptive algorithm to work. To my > > experience, the correlation between two different microphones is > > rather low, especially if there is a lossy compression on the way. > > I'm not using compression (both sound are PCM, one is recorded at > 44.1Khz, mono, 16 bits, and the other one at 96KHz, stereo, 24 bits, but > downsampled to the type of the first one before processing it) but as > you guessed it, unfortunately one mic can be more sensitive to some > sound (like wind for example) and this can complicate the computation. > (See my first link at the top.) > > Thanks for your help! > > -- > Fr=E9d=E9ric Jolliton
Frederic, be aware that there will not only be a time offset between these two recordings but there MAY also be a SPEED offset. Consider that the two recording devices each have a tolerance to their internal clocks relative to the playback machine. If you achieved perfect time alignment at the start of the recordings, and the recordings are say 1 hour long, you may find that after 1 hour they are no longer in time alignemnt. If the clocks were off by 100 ppm, after 1 hour the time error could be a significant fraction of a second. Depending upon your purpose, lip sync or stereo image phasing, this will be significant. If this is critical to you, it would be best if the algorithm could be made continously adaptive so that it would track and speed differences. Fortunalty with modern gear, the speed differences should be very small. Also remember that if the two recordings were made with different microphones, there will be about 1 ms per foot offset in time due to the speed of sound. So it depends what your goal is in combining these two recordings. This technique is a common practice in audio production and you may get more insight asking at rec.audio.pro. Mark Mark
Reply by October 25, 20092009-10-25
[..]
>> Is there a more robust and more efficient way to find how much time >> separate 2 (or more) audio samples recorded at the same place, but with >> different devices?
Note: to get a better idea of the spectogram I working with, see: http://tuxee.net/tmp/audiosync The thing is that visually I can easily match them, but I do not see how to translate that numerically, avoiding noise and other perturbations.
> What you do is pretty robust although computationally heavy. Just a > couple of suggestions: > > 1. Compare not the spectrograms, but the time derivatives of the > spectrograms. That will cancel the static frequency skew.
Are you suggesting something like: spectogram = spectogram(2:end,:) - spectogram(1:end-1,:); (using Matlab/Octave syntax) assuming first dimension is the time axis and the second dimension is the frequency axis? I've tested it, but the result is less accurate. See: http://tuxee.net/tmp/spect-deriv.png (Here, the expected answer is around 1855) I've slightly scaled one of the graph to match the other one. The red graph clearly indicate the expected answer, but the green one does not. Maybe you were talking about a different computation?
> 2. Normalize the spectrograms wrt the power of the signals. That will > make the simularity measure independent of the volume.
How to compute the power of the signals? Actually, while both sound are close together, one of the record can be more sensitive to wind, or one can record sound not heard by the other one (but note that both recorders are less than 1 meter apart), so it might be hard to find how to normalize them together. For the spectograms, I'm actually working in the log scale for the amplitudes, so when comparing them by subtracting component together, this should cancel possible volume difference a bit I guess. (Better than if I was keeping FFT output without applying log when computing the spectogram.)
> It could be possible to make an adaptive filter to minimize the > difference between the audio streams. After the filter is converged, > it is simple enough to derive the time shift from the coefficients.
While usually the sound will be few seconds apart, I don't know if a adaptive filter would work with larger time difference. But I'm not familiar enough with such filters thought.
> That would be much more accurate and less computationally demanding, > then the spectrograms. However it works only if the streams are > sufficiently correlated for adaptive algorithm to work. To my > experience, the correlation between two different microphones is > rather low, especially if there is a lossy compression on the way.
I'm not using compression (both sound are PCM, one is recorded at 44.1Khz, mono, 16 bits, and the other one at 96KHz, stereo, 24 bits, but downsampled to the type of the first one before processing it) but as you guessed it, unfortunately one mic can be more sensitive to some sound (like wind for example) and this can complicate the computation. (See my first link at the top.) Thanks for your help! -- Fr&#4294967295;d&#4294967295;ric Jolliton
Reply by Vladimir Vassilevsky October 24, 20092009-10-24

Fr&#4294967295;d&#4294967295;ric Jolliton wrote:

> I'm looking for a method to automatically synchronize various audio > tracks, recorded at the same place, with different devices. This is > intended to work at post-processing time (not in realtime.) > > Basically, I'm taking two audios track: one recorded by a camcorder, > with poor mic quality, and an extra one recorded at the same time with a > dedicated sound recorder, recording the same thing. > > My naive approach is as follow: I compute spectogram for both sound > (using FFT) which give me a 2D array for each spectogram, then I try > various shift to find the best match between them. > > To compare two spectograms with a given shift, I take the overlapping > parts once shifted, then I compute the mean value of the absolute > difference between them, which I divide by the width of the overlap. > (Hope that make sense.. I'm lacking adequate terminology here.) > > Then I keep the best (smaller) answer found while trying various shift. > > This method seems to work well for few tests I made. When plotting the > value against the shift, I see a peak toward 0 where audio tracks match. > > Is there a more robust and more efficient way to find how much time > separate 2 (or more) audio samples recorded at the same place, but with > different devices?
What you do is pretty robust although computationally heavy. Just a couple of suggestions: 1. Compare not the spectrograms, but the time derivatives of the spectrograms. That will cancel the static frequency skew. 2. Normalize the spectrograms wrt the power of the signals. That will make the simularity measure independent of the volume. It could be possible to make an adaptive filter to minimize the difference between the audio streams. After the filter is converged, it is simple enough to derive the time shift from the coefficients. That would be much more accurate and less computationally demanding, then the spectrograms. However it works only if the streams are sufficiently correlated for adaptive algorithm to work. To my experience, the correlation between two different microphones is rather low, especially if there is a lossy compression on the way. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Reply by October 24, 20092009-10-24
Hi,

I'm looking for a method to automatically synchronize various audio
tracks, recorded at the same place, with different devices. This is
intended to work at post-processing time (not in realtime.)

Basically, I'm taking two audios track: one recorded by a camcorder,
with poor mic quality, and an extra one recorded at the same time with a
dedicated sound recorder, recording the same thing.

My naive approach is as follow: I compute spectogram for both sound
(using FFT) which give me a 2D array for each spectogram, then I try
various shift to find the best match between them.

To compare two spectograms with a given shift, I take the overlapping
parts once shifted, then I compute the mean value of the absolute
difference between them, which I divide by the width of the overlap.
(Hope that make sense.. I'm lacking adequate terminology here.)

Then I keep the best (smaller) answer found while trying various shift.

This method seems to work well for few tests I made. When plotting the
value against the shift, I see a peak toward 0 where audio tracks match.

Is there a more robust and more efficient way to find how much time
separate 2 (or more) audio samples recorded at the same place, but with
different devices?

-- 
Fr&#4294967295;d&#4294967295;ric Jolliton