# subtracting two audio samples, how?

Started by March 13, 2004
```Hi:

I've gave considerable thought into this but my math sucks..  I have
two sound samples, each 10-20 seconds of sound at something like 44Khz
- I need to subtract one from another, so I end up with a prototype
and a difference, where the difference should be very small, and
overall uses less storage.  Supposed both samples are extremely
similar to our human ears, and the beat of the samples are at quite
good matches.  This problem is very similar to the joint-stereo
compression used by mp3/ogg/etc, if any different at all, but I'm
stuck at differencing, here are the things I've tried:

1. I could do subtraction in time domain, but the result is more like
a union of the two samples rather then difference - the problem is the
phase is all different even though they sounded the same to our ear.
The difference computed this way is useless to me because they are too
big.

2. I tried it in DCT domain - the problem is when I subtract the
coefficients of the two time series in this domain, the result
difference is still too large - afterall, I'm not dealing with
'stereo' difference, but rather similarity between samples that could
be off sync at multiple places with different - say - background
accompanyment.

Are there existing good algorithms for dealing with this?  E.g.
combine time-warping with DCT?

Bill

p.s. Is this the same problem as removing one audio signal from
another - like noise cancellation - but instead of creating opposite
waveform, we have two very similar waveforms that is most likely to
have different phases all over the place..  e.g. sample of same
performance recorded at different time..
```
```Bill Chiu wrote:

> Hi:
>
> I've gave considerable thought into this but my math sucks..  I have
> two sound samples, each 10-20 seconds of sound at something like 44Khz
> - I need to subtract one from another, so I end up with a prototype
> and a difference, where the difference should be very small, and
> overall uses less storage.  Supposed both samples are extremely
> similar to our human ears, and the beat of the samples are at quite
> good matches.  This problem is very similar to the joint-stereo
> compression used by mp3/ogg/etc, if any different at all, but I'm
> stuck at differencing, here are the things I've tried:
>
> 1. I could do subtraction in time domain, but the result is more like
> a union of the two samples rather then difference - the problem is the
> phase is all different even though they sounded the same to our ear.
> The difference computed this way is useless to me because they are too
> big.
>
> 2. I tried it in DCT domain - the problem is when I subtract the
> coefficients of the two time series in this domain, the result
> difference is still too large - afterall, I'm not dealing with
> 'stereo' difference, but rather similarity between samples that could
> be off sync at multiple places with different - say - background
> accompanyment.
>
> Are there existing good algorithms for dealing with this?  E.g.
> combine time-warping with DCT?
>
> Bill
>
> p.s. Is this the same problem as removing one audio signal from
> another - like noise cancellation - but instead of creating opposite
> waveform, we have two very similar waveforms that is most likely to
> have different phases all over the place..  e.g. sample of same
> performance recorded at different time..

There might be something you can do in certain very specific situations,
but in general, you're out of luck. It's like trying to subtract two
pictures of a stormy ocean to depict a flat calm.

Jerry
--
Engineering is the art of making what you want from things you can get.
&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;

```
```Jerry Avins wrote:
> Bill Chiu wrote:
> > I've gave considerable thought into this but my math sucks..  I have
> > two sound samples, each 10-20 seconds of sound at something like 44Khz
> > - I need to subtract one from another, so I end up with a prototype
> > and a difference, where the difference should be very small, and
> > overall uses less storage.  Supposed both samples are extremely
> > similar to our human ears, and the beat of the samples are at quite
> > good matches.  This problem is very similar to the joint-stereo
> > compression used by mp3/ogg/etc, if any different at all, but I'm
> > stuck at differencing, here are the things I've tried:
> >
> > 1. I could do subtraction in time domain, but the result is more like
> > a union of the two samples rather then difference - the problem is the
> > phase is all different even though they sounded the same to our ear.
> > The difference computed this way is useless to me because they are too
> > big.

...

> There might be something you can do in certain very specific situations,
> but in general, you're out of luck. It's like trying to subtract two
> pictures of a stormy ocean to depict a flat calm.

All need not be lost (nice analogy though Jerry!). You say the the two
sound files are similar. Why do you know that? Here are some
suggestions:

... the same sound recorded with two different microphones?
... the result of applying two different algorithms to the same
original sound?
... two speakers saying the same thing?
... two bands playing the same song (or one band playing the same song
twice)?

Whatever, if you have some a priori knowledge about why the two sound
files are similar, there is good chance that you can use that to your

It would also help to know your final goal. Perhaps somebody has

Regards,
Andor

PS: Noise reduction does not work by subtracting some noise signal. It
is usually based on filtering and gating.
```
```Andor,

The sound sources are closest to the example: "one band playing the
same song in two different concerts," where between the two samples,
the beat matches almost  exactly to 'trained ears', but to that same
pair of ears, there are clear but minor vocal, instrumental,
arrangement, and acoustic differences between the two recordings.

One of the thing I'm interested in achieving is to listen only to
these sonic differences without the full song also being heard.

More interestingly, I'm looking for a way to compress these very
similar two samples so to end up with one original (or mean) track,
and another differences track - and in such a way that with these two
tracks I can reproduce the original - lossy but 'high' quality; the
second (difference) track need to be made of mostly small values
(hence - compresses better then trying to compress the original
track(s))

I have tested an algorithm that can efficiently identify similar parts
of same/different audio samples e.g the chorus of a pop songs.  If
this differencing of similar samples problem can be solved, then we
can compress mp3/ogg more far more efficiently by collapsing repeated
parts.

Bill

an2or@mailcircuit.com (Andor) wrote in message news:<ce45f9ed.0403140104.7cd1e70@posting.google.com>...
> <snip>
> All need not be lost (nice analogy though Jerry!). You say the the two
> sound files are similar. Why do you know that? Here are some
> suggestions:
>
> ... the same sound recorded with two different microphones?
> ... the result of applying two different algorithms to the same
> original sound?
> ... two speakers saying the same thing?
> ... two bands playing the same song (or one band playing the same song
> twice)?
>
> Whatever, if you have some a priori knowledge about why the two sound
> files are similar, there is good chance that you can use that to your
>
> It would also help to know your final goal. Perhaps somebody has
>
> Regards,
> Andor
>
> PS: Noise reduction does not work by subtracting some noise signal. It
> is usually based on filtering and gating.
```
```Bill Chiu wrote:
> Andor,
>
> The sound sources are closest to the example: "one band playing the
> same song in two different concerts," where between the two samples,
> the beat matches almost  exactly to 'trained ears', but to that same
> pair of ears, there are clear but minor vocal, instrumental,
> arrangement, and acoustic differences between the two recordings.
>
> One of the thing I'm interested in achieving is to listen only to
> these sonic differences without the full song also being heard.
>
> More interestingly, I'm looking for a way to compress these very
> similar two samples so to end up with one original (or mean) track,
> and another differences track - and in such a way that with these two
> tracks I can reproduce the original - lossy but 'high' quality; the
> second (difference) track need to be made of mostly small values
> (hence - compresses better then trying to compress the original
> track(s))
>
> I have tested an algorithm that can efficiently identify similar parts
> of same/different audio samples e.g the chorus of a pop songs.  If
> this differencing of similar samples problem can be solved, then we
> can compress mp3/ogg more far more efficiently by collapsing repeated
> parts.
>
> Bill

Bill,

Let me touch on just one of difficulties ahead of you: time differences.
The sounds in the recordings consist of frequencies between, say, 30 and
10,000 cycles per second. These sounds are made by vibrating strings or
air columns. A sound at 500 cycles is completely canceled by a replica
of equal strength that is delayed by the time sound needs to travel a
foot in air. The precise time that a sound starts relative to another
(put differently, its phase) is not controllable by someone playing an
instrument. Two guitars doubled on the same part sound like two guitars,
not one louder one. If the situation were simple enough for addition
(hence also subtraction) to work, they would sound like one guitar, just
a bit louder.

Our ears don't directly bring sound-pressure waves to our consciousness.
Instead, we hear abstractions: pitches, timbres, phonemes, more. You
want to analyze your recordings on the basis of the abstractions you
hear. The simple techniques of DSP deal with waveforms. Abstraction is
much harder. As far as I know, no program can "take dictation" (the
dream of microphone to written score remains remote), yet even I, whose
musical training ended nearly 60 years ago, can perceive as distinct
four-part harmonies. You are asking for a lot!

Jerry
--
Engineering is the art of making what you want from things you can get.
&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;

```
```They are getting closer every day. I have heard some impressive demos of blind
source separation, from a track will clearly differentiated voices (drums, bass,
guitar, sort of thing), easily good enough to perform remixing.  This is all
part of the global effort towards MPEG-7, which also demands sound
classification tools. Monophonic trasncription is now very well established, at
least for pitched sources.

The canonical MPEG-7 source distribution is available for anyone to download and
explore:

http://www.lis.e-technik.tu-muenchen.de/research/bv/topics/mmdb/e_mpeg7.html

Put BSS together with beat detection and source classification, and full
transcription of polyphonic sources (if distinct and not too dense in voices I
guess) is close to doable. Ditto for separating speakers in the manner of the
coktail party effect (needs spatial info, i.e. at least two microphones). Work
is also intense on the much more difficult problem of homogeneous sources too,
see e.g. Dafx-03 (http://www.elec.qmul.ac.uk/dafx03/).  It is certainly far from
easy, but I would hesitate to say "remote" now, especially if you have some a
priori information, such as a known number of voices.

Richard Dobson

Jerry Avins wrote:
...
>
> Our ears don't directly bring sound-pressure waves to our consciousness.
> Instead, we hear abstractions: pitches, timbres, phonemes, more. You
> want to analyze your recordings on the basis of the abstractions you
> hear. The simple techniques of DSP deal with waveforms. Abstraction is
> much harder. As far as I know, no program can "take dictation" (the
> dream of microphone to written score remains remote), yet even I, whose
> musical training ended nearly 60 years ago, can perceive as distinct
> four-part harmonies. You are asking for a lot!
>
> Jerry

```
```OK: I retract "remote" as it applies to the industry generally. We agree
on difficult, but some programs to do difficult things are available for
purchase. On the time scale that interests the OP, "remote" might be a
reasonable description if no one can suggest available software.

Jerry
--
Engineering is the art of making what you want from things you can get.
&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;
Richard Dobson wrote:

> They are getting closer every day. I have heard some impressive demos of
> blind source separation, from a track will clearly differentiated voices
> (drums, bass, guitar, sort of thing), easily good enough to perform
> remixing.  This is all part of the global effort towards MPEG-7, which
> also demands sound classification tools. Monophonic trasncription is now
> very well established, at least for pitched sources.
>
> The canonical MPEG-7 source distribution is available for anyone to
>
> http://www.lis.e-technik.tu-muenchen.de/research/bv/topics/mmdb/e_mpeg7.html
>
>
>
> Put BSS together with beat detection and source classification, and full
> transcription of polyphonic sources (if distinct and not too dense in
> voices I guess) is close to doable. Ditto for separating speakers in the
> manner of the coktail party effect (needs spatial info, i.e. at least
> two microphones). Work is also intense on the much more difficult
> problem of homogeneous sources too, see e.g. Dafx-03
> (http://www.elec.qmul.ac.uk/dafx03/).  It is certainly far from easy,
> but I would hesitate to say "remote" now, especially if you have some a
> priori information, such as a known number of voices.
>
>
> Richard Dobson
>
>
>
> Jerry Avins wrote:
>  ...
>
>>
>> Our ears don't directly bring sound-pressure waves to our consciousness.
>> Instead, we hear abstractions: pitches, timbres, phonemes, more. You
>> want to analyze your recordings on the basis of the abstractions you
>> hear. The simple techniques of DSP deal with waveforms. Abstraction is
>> much harder. As far as I know, no program can "take dictation" (the
>> dream of microphone to written score remains remote), yet even I, whose
>> musical training ended nearly 60 years ago, can perceive as distinct
>> four-part harmonies. You are asking for a lot!
>>
>> Jerry

```