DSPRelated.com
Forums

black box system

Started by Jim Rex October 22, 2004
Hello,

Let's say I have a signal that contains my voice saying "1 2 3" and then I
have another signal that has my voice saying "1 2 3" but this time I paused
for 2 seconds in the beginning. How can I create a filter that would apply
the effect of delay? Or let's say I have my voice saying "1 2 3" and someone
else's voice saying "1 2 3" and I want to try to capture that effect of
changing my voice to someone else's.

I tried something like this, but it didn't work:

a = first voice
b = second voice

aa = fft(a);
bb = fft(b);

cc = aa./bb;

Shouldn't cc now encompass the effect I'm aiming for?


What technique should I use to get this done?


Thank you


On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said:

> Hello, > > Let's say I have a signal that contains my voice saying "1 2 3" and then I > have another signal that has my voice saying "1 2 3" but this time I paused > for 2 seconds in the beginning. How can I create a filter that would apply > the effect of delay? Or let's say I have my voice saying "1 2 3" and someone > else's voice saying "1 2 3" and I want to try to capture that effect of > changing my voice to someone else's. > > I tried something like this, but it didn't work: > > a = first voice > b = second voice > > aa = fft(a); > bb = fft(b); > > cc = aa./bb; > > Shouldn't cc now encompass the effect I'm aiming for? > > > What technique should I use to get this done? > > > Thank you
This is far more difficult than you would think. The first scenario is somewhat easier - it involves estimating a delay time for (roughly) the same signal. If your spoken numbers are very similar in both cases (for example, because you copied the data from the first signal and added zeros to delay it by 2 s) you can use correlation to find the lag between the first and the second sentence. If you have a priori knowledge of what the words (numbers) are then this can be as simple as cutting out the silence between them to match their onsets in time. But if you have recordings of different people, different numbers or speak very differently in both cases the scenario quickly gets much more complicated. The worst case would involve resorting to techniques used in speech recognition to create a symbolic (maybe even phonetic) representation of the spoken words that you can compare. Changing your voice to sound like somebody else's is even more involved. Here the problem is that our ear is very good at detecting anything unusual about a voice so you might not get a perfect result. Second, there are many factors that determine the overall "sound" of a voice - the way we speak (prosodic parameters including timing, intonation, pronounciation, emphasis) and the "tone" of the voice (both at the excitation and formant level). Be prepared that modelling all these factors is a very demanding task... -- Stephan M. Bernsee http://www.dspdimension.com
Thank you for your informed response.

Hmmm... What I was trying to say was that given a clear recording of
someone's voice saying a certain sentence, and then having another recording
of that same person's voice saying that same sentence but with somekind of
noise or 'effect', how can one use this input and output to try to define
that noise or effect so that next time a person say's something and that
noise or effect is present, I can remove it because I have previously
'modeled' that noise or 'effect' in someway.

Thank you again.

"Stephan M. Bernsee" <spam@dspdimension.com> wrote in message
news:2trm9gF237igvU1@uni-berlin.de...
> On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > Hello, > > > > Let's say I have a signal that contains my voice saying "1 2 3" and then
I
> > have another signal that has my voice saying "1 2 3" but this time I
paused
> > for 2 seconds in the beginning. How can I create a filter that would
apply
> > the effect of delay? Or let's say I have my voice saying "1 2 3" and
someone
> > else's voice saying "1 2 3" and I want to try to capture that effect of > > changing my voice to someone else's. > > > > I tried something like this, but it didn't work: > > > > a = first voice > > b = second voice > > > > aa = fft(a); > > bb = fft(b); > > > > cc = aa./bb; > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > What technique should I use to get this done? > > > > > > Thank you > > This is far more difficult than you would think. > > The first scenario is somewhat easier - it involves estimating a delay > time for (roughly) the same signal. If your spoken numbers are very > similar in both cases (for example, because you copied the data from > the first signal and added zeros to delay it by 2 s) you can use > correlation to find the lag between the first and the second sentence. > If you have a priori knowledge of what the words (numbers) are then > this can be as simple as cutting out the silence between them to match > their onsets in time. > > But if you have recordings of different people, different numbers or > speak very differently in both cases the scenario quickly gets much > more complicated. The worst case would involve resorting to techniques > used in speech recognition to create a symbolic (maybe even phonetic) > representation of the spoken words that you can compare. > > Changing your voice to sound like somebody else's is even more > involved. Here the problem is that our ear is very good at detecting > anything unusual about a voice so you might not get a perfect result. > Second, there are many factors that determine the overall "sound" of a > voice - the way we speak (prosodic parameters including timing, > intonation, pronounciation, emphasis) and the "tone" of the voice (both > at the excitation and formant level). Be prepared that modelling all > these factors is a very demanding task... > -- > Stephan M. Bernsee > http://www.dspdimension.com >
I think it depends on the type of "noise or effect" being applied.  First of
all, is the voice on the 2 recordings the exact same thing?  Not the same person
saying the same thing (like an actor doing 2 different takes of the same
material), but actually the same recording?  If so, there are solutions.  If the
only difference is additive noise, there are noise reduction techniques that can
remove the noise.  The ones I'm familiar with operate in the frequency domain
using "spectral subtraction".  If the effect is a linear filter (including
echo/reverb), then adaptive filters can estimate and compensate for the filter.
This is commonly performed in echo canceling applications.

"Jim Rex" <jimrex0@hotmail.com> wrote in message
news:jsCdnXeOn7aAbuXcRVn-1w@rogers.com...
> Thank you for your informed response. > > Hmmm... What I was trying to say was that given a clear recording of > someone's voice saying a certain sentence, and then having another recording > of that same person's voice saying that same sentence but with somekind of > noise or 'effect', how can one use this input and output to try to define > that noise or effect so that next time a person say's something and that > noise or effect is present, I can remove it because I have previously > 'modeled' that noise or 'effect' in someway. > > Thank you again. > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > news:2trm9gF237igvU1@uni-berlin.de... > > On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > > > Hello, > > > > > > Let's say I have a signal that contains my voice saying "1 2 3" and then > I > > > have another signal that has my voice saying "1 2 3" but this time I > paused > > > for 2 seconds in the beginning. How can I create a filter that would > apply > > > the effect of delay? Or let's say I have my voice saying "1 2 3" and > someone > > > else's voice saying "1 2 3" and I want to try to capture that effect of > > > changing my voice to someone else's. > > > > > > I tried something like this, but it didn't work: > > > > > > a = first voice > > > b = second voice > > > > > > aa = fft(a); > > > bb = fft(b); > > > > > > cc = aa./bb; > > > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > > > > What technique should I use to get this done? > > > > > > > > > Thank you > > > > This is far more difficult than you would think. > > > > The first scenario is somewhat easier - it involves estimating a delay > > time for (roughly) the same signal. If your spoken numbers are very > > similar in both cases (for example, because you copied the data from > > the first signal and added zeros to delay it by 2 s) you can use > > correlation to find the lag between the first and the second sentence. > > If you have a priori knowledge of what the words (numbers) are then > > this can be as simple as cutting out the silence between them to match > > their onsets in time. > > > > But if you have recordings of different people, different numbers or > > speak very differently in both cases the scenario quickly gets much > > more complicated. The worst case would involve resorting to techniques > > used in speech recognition to create a symbolic (maybe even phonetic) > > representation of the spoken words that you can compare. > > > > Changing your voice to sound like somebody else's is even more > > involved. Here the problem is that our ear is very good at detecting > > anything unusual about a voice so you might not get a perfect result. > > Second, there are many factors that determine the overall "sound" of a > > voice - the way we speak (prosodic parameters including timing, > > intonation, pronounciation, emphasis) and the "tone" of the voice (both > > at the excitation and formant level). Be prepared that modelling all > > these factors is a very demanding task... > > -- > > Stephan M. Bernsee > > http://www.dspdimension.com > > > >
It's like an actor doing 2 different takes saying the same material, but
with the 2nd one having some noise.

What's i'm trying to do is take in someone's voice while he/she is close to
the microphone, and another one when he or she is far away, and I'm trying
to model this effect in some way. So next time a person is far away from the
microphone, I can apply an effect to make it sound as if the person is
talking directly into the microphone in a crisp and clear voice.



"Jon Harris" <goldentully@hotmail.com> wrote in message
news:2tsspnF236o1rU1@uni-berlin.de...
> I think it depends on the type of "noise or effect" being applied. First
of
> all, is the voice on the 2 recordings the exact same thing? Not the same
person
> saying the same thing (like an actor doing 2 different takes of the same > material), but actually the same recording? If so, there are solutions.
If the
> only difference is additive noise, there are noise reduction techniques
that can
> remove the noise. The ones I'm familiar with operate in the frequency
domain
> using "spectral subtraction". If the effect is a linear filter (including > echo/reverb), then adaptive filters can estimate and compensate for the
filter.
> This is commonly performed in echo canceling applications. > > "Jim Rex" <jimrex0@hotmail.com> wrote in message > news:jsCdnXeOn7aAbuXcRVn-1w@rogers.com... > > Thank you for your informed response. > > > > Hmmm... What I was trying to say was that given a clear recording of > > someone's voice saying a certain sentence, and then having another
recording
> > of that same person's voice saying that same sentence but with somekind
of
> > noise or 'effect', how can one use this input and output to try to
define
> > that noise or effect so that next time a person say's something and that > > noise or effect is present, I can remove it because I have previously > > 'modeled' that noise or 'effect' in someway. > > > > Thank you again. > > > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > > news:2trm9gF237igvU1@uni-berlin.de... > > > On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > > > > > Hello, > > > > > > > > Let's say I have a signal that contains my voice saying "1 2 3" and
then
> > I > > > > have another signal that has my voice saying "1 2 3" but this time I > > paused > > > > for 2 seconds in the beginning. How can I create a filter that would > > apply > > > > the effect of delay? Or let's say I have my voice saying "1 2 3" and > > someone > > > > else's voice saying "1 2 3" and I want to try to capture that effect
of
> > > > changing my voice to someone else's. > > > > > > > > I tried something like this, but it didn't work: > > > > > > > > a = first voice > > > > b = second voice > > > > > > > > aa = fft(a); > > > > bb = fft(b); > > > > > > > > cc = aa./bb; > > > > > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > > > > > > > What technique should I use to get this done? > > > > > > > > > > > > Thank you > > > > > > This is far more difficult than you would think. > > > > > > The first scenario is somewhat easier - it involves estimating a delay > > > time for (roughly) the same signal. If your spoken numbers are very > > > similar in both cases (for example, because you copied the data from > > > the first signal and added zeros to delay it by 2 s) you can use > > > correlation to find the lag between the first and the second sentence. > > > If you have a priori knowledge of what the words (numbers) are then > > > this can be as simple as cutting out the silence between them to match > > > their onsets in time. > > > > > > But if you have recordings of different people, different numbers or > > > speak very differently in both cases the scenario quickly gets much > > > more complicated. The worst case would involve resorting to techniques > > > used in speech recognition to create a symbolic (maybe even phonetic) > > > representation of the spoken words that you can compare. > > > > > > Changing your voice to sound like somebody else's is even more > > > involved. Here the problem is that our ear is very good at detecting > > > anything unusual about a voice so you might not get a perfect result. > > > Second, there are many factors that determine the overall "sound" of a > > > voice - the way we speak (prosodic parameters including timing, > > > intonation, pronounciation, emphasis) and the "tone" of the voice
(both
> > > at the excitation and formant level). Be prepared that modelling all > > > these factors is a very demanding task... > > > -- > > > Stephan M. Bernsee > > > http://www.dspdimension.com > > > > > > > > >
Well, this obviously very simple question isn't quite that simple 
either - let me explain: In the realm of Digital Signal Processing, you 
define the desired output of your process to be the "signal" and the 
undesired output to be "noise". Note that these definitions are 
somewhat arbitrary but we'll use them without defining them exactly for 
now. In these terms, your problem essentially boils down to the 
question "does knowledge of the signal alone help to improve its 
signal-to-noise ratio"?

Now, if you know your signal *exactly*, the answer is "yes", because 
all you need to do is a simple subtraction. Obviously, that case isn't 
very useful because it practically never happens.

The more general (and therefore useful) case would be that your 
observed (ie. noisy) signal is *similar* to what you are expecting. In 
that case, the numeric subtraction needs to be replaced by something 
that can be seen as a more "general" form of a subtraction. Note that 
it is the difference between "exact" and "similar" that causes you a 
headache here, because "similar" can practically mean anything.

To get rid of the noise in the general case, one usually seeks to 
represent the observed data in a way that separates signal and noise 
from each other as much as possible, ie. use unique properties of the 
known signal to distinguish it from the noise (or vice-versa). It is 
easy to see that if the noise "behaves" very differently compared to 
the signal you might be lucky and achieve your goal. The more similar 
signal and noise are the more ambiguity will be introduced, and the 
less successful this will be.

Back to your observed data with and without an interfering "effect" of 
some sort. Since you're not exactly defining the type of noise (the 
"effect") you have to deal with in your application, it is difficult to 
say if you can be successful. There's a whole world of possible 
interfering effects that can create all types of noise.

As a rule of thumb, don't jump to the conclusion that once you have a 
definition of the outcome of an interfering process this would make the 
effect of that process on your signal orthogonal to the unprocessed 
signal and therefore easy to undo. In the majority of cases this is not 
possible, or at least not easy. Think about it: what if the effect was 
non-linear, or maybe time variant? Or if it would cause the speed of 
your signal to change? In each case, removing the interference would 
require a different approach.
-- 
Stephan M. Bernsee
http://www.dspdimension.com


On 2004-10-22 13:46:33 +0200, "Jim Rex" <jimrex0@hotmail.com> said:

> Thank you for your informed response. > > Hmmm... What I was trying to say was that given a clear recording of > someone's voice saying a certain sentence, and then having another recording > of that same person's voice saying that same sentence but with somekind of > noise or 'effect', how can one use this input and output to try to define > that noise or effect so that next time a person say's something and that > noise or effect is present, I can remove it because I have previously > 'modeled' that noise or 'effect' in someway. > > Thank you again. > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > news:2trm9gF237igvU1@uni-berlin.de... >> On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: >> >>> Hello, >>> >>> Let's say I have a signal that contains my voice saying "1 2 3" and then > I >>> have another signal that has my voice saying "1 2 3" but this time I > paused >>> for 2 seconds in the beginning. How can I create a filter that would > apply >>> the effect of delay? Or let's say I have my voice saying "1 2 3" and > someone >>> else's voice saying "1 2 3" and I want to try to capture that effect of >>> changing my voice to someone else's. >>> >>> I tried something like this, but it didn't work: >>> >>> a = first voice >>> b = second voice >>> >>> aa = fft(a); >>> bb = fft(b); >>> >>> cc = aa./bb; >>> >>> Shouldn't cc now encompass the effect I'm aiming for? >>> >>> >>> What technique should I use to get this done? >>> >>> >>> Thank you >> >> This is far more difficult than you would think. >> >> The first scenario is somewhat easier - it involves estimating a delay >> time for (roughly) the same signal. If your spoken numbers are very >> similar in both cases (for example, because you copied the data from >> the first signal and added zeros to delay it by 2 s) you can use >> correlation to find the lag between the first and the second sentence. >> If you have a priori knowledge of what the words (numbers) are then >> this can be as simple as cutting out the silence between them to match >> their onsets in time. >> >> But if you have recordings of different people, different numbers or >> speak very differently in both cases the scenario quickly gets much >> more complicated. The worst case would involve resorting to techniques >> used in speech recognition to create a symbolic (maybe even phonetic) >> representation of the spoken words that you can compare. >> >> Changing your voice to sound like somebody else's is even more >> involved. Here the problem is that our ear is very good at detecting >> anything unusual about a voice so you might not get a perfect result. >> Second, there are many factors that determine the overall "sound" of a >> voice - the way we speak (prosodic parameters including timing, >> intonation, pronounciation, emphasis) and the "tone" of the voice (both >> at the excitation and formant level). Be prepared that modelling all >> these factors is a very demanding task... >> -- >> Stephan M. Bernsee >> http://www.dspdimension.com
Ah, now I see where you're going. This is not noise reduction but 
rather some kind of removal of the room impulse response that gets 
convolved with your signal. Google for the keyword "deconvolution" and 
"echo cancelling", they might lead you in the right direction, even 
though your application is still quite tricky to implement if you need 
it to work in the general case...
-- 
Stephan M. Bernsee
http://www.dspdimension.com



On 2004-10-22 19:43:26 +0200, "Jim Rex" <jimrex0@hotmail.com> said:

> It's like an actor doing 2 different takes saying the same material, but > with the 2nd one having some noise. > > What's i'm trying to do is take in someone's voice while he/she is close to > the microphone, and another one when he or she is far away, and I'm trying > to model this effect in some way. So next time a person is far away from the > microphone, I can apply an effect to make it sound as if the person is > talking directly into the microphone in a crisp and clear voice. >
In that case, my suggestion about echo canceling does not apply.  Spectral
subtraction methods could still be used to remove "constant" noise*, as this
technique only relies on having a bit of the recording where there is nothing
but noise (so you wouldn't even need the clean recording).  But this probably
wouldn't be of much help with your scenario of making a distant-miked person
sound close-miked.  I don't think there is much hope for this today.

*constant noise like a fan that is always on, or tape hiss, not traffic or wind
noise that is changing.

"Jim Rex" <jimrex0@hotmail.com> wrote in message
news:DZmdnRlwk8kt2-TcRVn-qg@rogers.com...
> It's like an actor doing 2 different takes saying the same material, but > with the 2nd one having some noise. > > What's i'm trying to do is take in someone's voice while he/she is close to > the microphone, and another one when he or she is far away, and I'm trying > to model this effect in some way. So next time a person is far away from the > microphone, I can apply an effect to make it sound as if the person is > talking directly into the microphone in a crisp and clear voice. > > "Jon Harris" <goldentully@hotmail.com> wrote in message > news:2tsspnF236o1rU1@uni-berlin.de... > > I think it depends on the type of "noise or effect" being applied. First > of > > all, is the voice on the 2 recordings the exact same thing? Not the same > person > > saying the same thing (like an actor doing 2 different takes of the same > > material), but actually the same recording? If so, there are solutions. > If the > > only difference is additive noise, there are noise reduction techniques > that can > > remove the noise. The ones I'm familiar with operate in the frequency > domain > > using "spectral subtraction". If the effect is a linear filter (including > > echo/reverb), then adaptive filters can estimate and compensate for the > filter. > > This is commonly performed in echo canceling applications. > > > > "Jim Rex" <jimrex0@hotmail.com> wrote in message > > news:jsCdnXeOn7aAbuXcRVn-1w@rogers.com... > > > Thank you for your informed response. > > > > > > Hmmm... What I was trying to say was that given a clear recording of > > > someone's voice saying a certain sentence, and then having another > recording > > > of that same person's voice saying that same sentence but with somekind > of > > > noise or 'effect', how can one use this input and output to try to > define > > > that noise or effect so that next time a person say's something and that > > > noise or effect is present, I can remove it because I have previously > > > 'modeled' that noise or 'effect' in someway. > > > > > > Thank you again. > > > > > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > > > news:2trm9gF237igvU1@uni-berlin.de... > > > > On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > > > > > > > Hello, > > > > > > > > > > Let's say I have a signal that contains my voice saying "1 2 3" and > then > > > I > > > > > have another signal that has my voice saying "1 2 3" but this time I > > > paused > > > > > for 2 seconds in the beginning. How can I create a filter that would > > > apply > > > > > the effect of delay? Or let's say I have my voice saying "1 2 3" and > > > someone > > > > > else's voice saying "1 2 3" and I want to try to capture that effect > of > > > > > changing my voice to someone else's. > > > > > > > > > > I tried something like this, but it didn't work: > > > > > > > > > > a = first voice > > > > > b = second voice > > > > > > > > > > aa = fft(a); > > > > > bb = fft(b); > > > > > > > > > > cc = aa./bb; > > > > > > > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > > > > > > > > > > What technique should I use to get this done? > > > > > > > > > > > > > > > Thank you > > > > > > > > This is far more difficult than you would think. > > > > > > > > The first scenario is somewhat easier - it involves estimating a delay > > > > time for (roughly) the same signal. If your spoken numbers are very > > > > similar in both cases (for example, because you copied the data from > > > > the first signal and added zeros to delay it by 2 s) you can use > > > > correlation to find the lag between the first and the second sentence. > > > > If you have a priori knowledge of what the words (numbers) are then > > > > this can be as simple as cutting out the silence between them to match > > > > their onsets in time. > > > > > > > > But if you have recordings of different people, different numbers or > > > > speak very differently in both cases the scenario quickly gets much > > > > more complicated. The worst case would involve resorting to techniques > > > > used in speech recognition to create a symbolic (maybe even phonetic) > > > > representation of the spoken words that you can compare. > > > > > > > > Changing your voice to sound like somebody else's is even more > > > > involved. Here the problem is that our ear is very good at detecting > > > > anything unusual about a voice so you might not get a perfect result. > > > > Second, there are many factors that determine the overall "sound" of a > > > > voice - the way we speak (prosodic parameters including timing, > > > > intonation, pronounciation, emphasis) and the "tone" of the voice > (both > > > > at the excitation and formant level). Be prepared that modelling all > > > > these factors is a very demanding task... > > > > -- > > > > Stephan M. Bernsee > > > > http://www.dspdimension.com > > > > > > > > > > > > > > > >
I think deconvolution applies (or perhaps dereverberation), but I'm not so sure
about echo canceling, at least in the traditional teleconferencing usage.  In
that context, echo canceling requires a clean reference signal and an "effected"
signal that can be derived from the reference by (primarily) linear processes.

I think what Jim is probably after is blind, time-variant dereverberation, which
is a very difficult nut to crack!

-Jon

"Stephan M. Bernsee" <spam@dspdimension.com> wrote in message
news:2tt2d6F22iid5U1@uni-berlin.de...
> > Ah, now I see where you're going. This is not noise reduction but > rather some kind of removal of the room impulse response that gets > convolved with your signal. Google for the keyword "deconvolution" and > "echo cancelling", they might lead you in the right direction, even > though your application is still quite tricky to implement if you need > it to work in the general case... > -- > Stephan M. Bernsee > http://www.dspdimension.com > > On 2004-10-22 19:43:26 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > It's like an actor doing 2 different takes saying the same material, but > > with the 2nd one having some noise. > > > > What's i'm trying to do is take in someone's voice while he/she is close to > > the microphone, and another one when he or she is far away, and I'm trying > > to model this effect in some way. So next time a person is far away from the > > microphone, I can apply an effect to make it sound as if the person is > > talking directly into the microphone in a crisp and clear voice. > > >
Subtractive noise reduction also tends to degrade the signal quite a 
bit because signal and noise usually overlap in the spectral domain, so 
he will end up with something that may not be usable for his purpose...

-- 
Stephan M. Bernsee
http://www.dspdimension.com




On 2004-10-22 21:19:28 +0200, "Jon Harris" <goldentully@hotmail.com> said:

> In that case, my suggestion about echo canceling does not apply. Spectral > subtraction methods could still be used to remove "constant" noise*, as this > technique only relies on having a bit of the recording where there is nothing > but noise (so you wouldn't even need the clean recording). But this probably > wouldn't be of much help with your scenario of making a distant-miked person > sound close-miked. I don't think there is much hope for this today. > > *constant noise like a fan that is always on, or tape hiss, not traffic or wind > noise that is changing. >