Hello, Let's say I have a signal that contains my voice saying "1 2 3" and then I have another signal that has my voice saying "1 2 3" but this time I paused for 2 seconds in the beginning. How can I create a filter that would apply the effect of delay? Or let's say I have my voice saying "1 2 3" and someone else's voice saying "1 2 3" and I want to try to capture that effect of changing my voice to someone else's. I tried something like this, but it didn't work: a = first voice b = second voice aa = fft(a); bb = fft(b); cc = aa./bb; Shouldn't cc now encompass the effect I'm aiming for? What technique should I use to get this done? Thank you
black box system
Started by ●October 22, 2004
Reply by ●October 22, 20042004-10-22
On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said:> Hello, > > Let's say I have a signal that contains my voice saying "1 2 3" and then I > have another signal that has my voice saying "1 2 3" but this time I paused > for 2 seconds in the beginning. How can I create a filter that would apply > the effect of delay? Or let's say I have my voice saying "1 2 3" and someone > else's voice saying "1 2 3" and I want to try to capture that effect of > changing my voice to someone else's. > > I tried something like this, but it didn't work: > > a = first voice > b = second voice > > aa = fft(a); > bb = fft(b); > > cc = aa./bb; > > Shouldn't cc now encompass the effect I'm aiming for? > > > What technique should I use to get this done? > > > Thank youThis is far more difficult than you would think. The first scenario is somewhat easier - it involves estimating a delay time for (roughly) the same signal. If your spoken numbers are very similar in both cases (for example, because you copied the data from the first signal and added zeros to delay it by 2 s) you can use correlation to find the lag between the first and the second sentence. If you have a priori knowledge of what the words (numbers) are then this can be as simple as cutting out the silence between them to match their onsets in time. But if you have recordings of different people, different numbers or speak very differently in both cases the scenario quickly gets much more complicated. The worst case would involve resorting to techniques used in speech recognition to create a symbolic (maybe even phonetic) representation of the spoken words that you can compare. Changing your voice to sound like somebody else's is even more involved. Here the problem is that our ear is very good at detecting anything unusual about a voice so you might not get a perfect result. Second, there are many factors that determine the overall "sound" of a voice - the way we speak (prosodic parameters including timing, intonation, pronounciation, emphasis) and the "tone" of the voice (both at the excitation and formant level). Be prepared that modelling all these factors is a very demanding task... -- Stephan M. Bernsee http://www.dspdimension.com
Reply by ●October 22, 20042004-10-22
Thank you for your informed response. Hmmm... What I was trying to say was that given a clear recording of someone's voice saying a certain sentence, and then having another recording of that same person's voice saying that same sentence but with somekind of noise or 'effect', how can one use this input and output to try to define that noise or effect so that next time a person say's something and that noise or effect is present, I can remove it because I have previously 'modeled' that noise or 'effect' in someway. Thank you again. "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message news:2trm9gF237igvU1@uni-berlin.de...> On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > Hello, > > > > Let's say I have a signal that contains my voice saying "1 2 3" and thenI> > have another signal that has my voice saying "1 2 3" but this time Ipaused> > for 2 seconds in the beginning. How can I create a filter that wouldapply> > the effect of delay? Or let's say I have my voice saying "1 2 3" andsomeone> > else's voice saying "1 2 3" and I want to try to capture that effect of > > changing my voice to someone else's. > > > > I tried something like this, but it didn't work: > > > > a = first voice > > b = second voice > > > > aa = fft(a); > > bb = fft(b); > > > > cc = aa./bb; > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > What technique should I use to get this done? > > > > > > Thank you > > This is far more difficult than you would think. > > The first scenario is somewhat easier - it involves estimating a delay > time for (roughly) the same signal. If your spoken numbers are very > similar in both cases (for example, because you copied the data from > the first signal and added zeros to delay it by 2 s) you can use > correlation to find the lag between the first and the second sentence. > If you have a priori knowledge of what the words (numbers) are then > this can be as simple as cutting out the silence between them to match > their onsets in time. > > But if you have recordings of different people, different numbers or > speak very differently in both cases the scenario quickly gets much > more complicated. The worst case would involve resorting to techniques > used in speech recognition to create a symbolic (maybe even phonetic) > representation of the spoken words that you can compare. > > Changing your voice to sound like somebody else's is even more > involved. Here the problem is that our ear is very good at detecting > anything unusual about a voice so you might not get a perfect result. > Second, there are many factors that determine the overall "sound" of a > voice - the way we speak (prosodic parameters including timing, > intonation, pronounciation, emphasis) and the "tone" of the voice (both > at the excitation and formant level). Be prepared that modelling all > these factors is a very demanding task... > -- > Stephan M. Bernsee > http://www.dspdimension.com >
Reply by ●October 22, 20042004-10-22
I think it depends on the type of "noise or effect" being applied. First of all, is the voice on the 2 recordings the exact same thing? Not the same person saying the same thing (like an actor doing 2 different takes of the same material), but actually the same recording? If so, there are solutions. If the only difference is additive noise, there are noise reduction techniques that can remove the noise. The ones I'm familiar with operate in the frequency domain using "spectral subtraction". If the effect is a linear filter (including echo/reverb), then adaptive filters can estimate and compensate for the filter. This is commonly performed in echo canceling applications. "Jim Rex" <jimrex0@hotmail.com> wrote in message news:jsCdnXeOn7aAbuXcRVn-1w@rogers.com...> Thank you for your informed response. > > Hmmm... What I was trying to say was that given a clear recording of > someone's voice saying a certain sentence, and then having another recording > of that same person's voice saying that same sentence but with somekind of > noise or 'effect', how can one use this input and output to try to define > that noise or effect so that next time a person say's something and that > noise or effect is present, I can remove it because I have previously > 'modeled' that noise or 'effect' in someway. > > Thank you again. > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > news:2trm9gF237igvU1@uni-berlin.de... > > On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > > > Hello, > > > > > > Let's say I have a signal that contains my voice saying "1 2 3" and then > I > > > have another signal that has my voice saying "1 2 3" but this time I > paused > > > for 2 seconds in the beginning. How can I create a filter that would > apply > > > the effect of delay? Or let's say I have my voice saying "1 2 3" and > someone > > > else's voice saying "1 2 3" and I want to try to capture that effect of > > > changing my voice to someone else's. > > > > > > I tried something like this, but it didn't work: > > > > > > a = first voice > > > b = second voice > > > > > > aa = fft(a); > > > bb = fft(b); > > > > > > cc = aa./bb; > > > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > > > > What technique should I use to get this done? > > > > > > > > > Thank you > > > > This is far more difficult than you would think. > > > > The first scenario is somewhat easier - it involves estimating a delay > > time for (roughly) the same signal. If your spoken numbers are very > > similar in both cases (for example, because you copied the data from > > the first signal and added zeros to delay it by 2 s) you can use > > correlation to find the lag between the first and the second sentence. > > If you have a priori knowledge of what the words (numbers) are then > > this can be as simple as cutting out the silence between them to match > > their onsets in time. > > > > But if you have recordings of different people, different numbers or > > speak very differently in both cases the scenario quickly gets much > > more complicated. The worst case would involve resorting to techniques > > used in speech recognition to create a symbolic (maybe even phonetic) > > representation of the spoken words that you can compare. > > > > Changing your voice to sound like somebody else's is even more > > involved. Here the problem is that our ear is very good at detecting > > anything unusual about a voice so you might not get a perfect result. > > Second, there are many factors that determine the overall "sound" of a > > voice - the way we speak (prosodic parameters including timing, > > intonation, pronounciation, emphasis) and the "tone" of the voice (both > > at the excitation and formant level). Be prepared that modelling all > > these factors is a very demanding task... > > -- > > Stephan M. Bernsee > > http://www.dspdimension.com > > > >
Reply by ●October 22, 20042004-10-22
It's like an actor doing 2 different takes saying the same material, but with the 2nd one having some noise. What's i'm trying to do is take in someone's voice while he/she is close to the microphone, and another one when he or she is far away, and I'm trying to model this effect in some way. So next time a person is far away from the microphone, I can apply an effect to make it sound as if the person is talking directly into the microphone in a crisp and clear voice. "Jon Harris" <goldentully@hotmail.com> wrote in message news:2tsspnF236o1rU1@uni-berlin.de...> I think it depends on the type of "noise or effect" being applied. Firstof> all, is the voice on the 2 recordings the exact same thing? Not the sameperson> saying the same thing (like an actor doing 2 different takes of the same > material), but actually the same recording? If so, there are solutions.If the> only difference is additive noise, there are noise reduction techniquesthat can> remove the noise. The ones I'm familiar with operate in the frequencydomain> using "spectral subtraction". If the effect is a linear filter (including > echo/reverb), then adaptive filters can estimate and compensate for thefilter.> This is commonly performed in echo canceling applications. > > "Jim Rex" <jimrex0@hotmail.com> wrote in message > news:jsCdnXeOn7aAbuXcRVn-1w@rogers.com... > > Thank you for your informed response. > > > > Hmmm... What I was trying to say was that given a clear recording of > > someone's voice saying a certain sentence, and then having anotherrecording> > of that same person's voice saying that same sentence but with somekindof> > noise or 'effect', how can one use this input and output to try todefine> > that noise or effect so that next time a person say's something and that > > noise or effect is present, I can remove it because I have previously > > 'modeled' that noise or 'effect' in someway. > > > > Thank you again. > > > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > > news:2trm9gF237igvU1@uni-berlin.de... > > > On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > > > > > Hello, > > > > > > > > Let's say I have a signal that contains my voice saying "1 2 3" andthen> > I > > > > have another signal that has my voice saying "1 2 3" but this time I > > paused > > > > for 2 seconds in the beginning. How can I create a filter that would > > apply > > > > the effect of delay? Or let's say I have my voice saying "1 2 3" and > > someone > > > > else's voice saying "1 2 3" and I want to try to capture that effectof> > > > changing my voice to someone else's. > > > > > > > > I tried something like this, but it didn't work: > > > > > > > > a = first voice > > > > b = second voice > > > > > > > > aa = fft(a); > > > > bb = fft(b); > > > > > > > > cc = aa./bb; > > > > > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > > > > > > > What technique should I use to get this done? > > > > > > > > > > > > Thank you > > > > > > This is far more difficult than you would think. > > > > > > The first scenario is somewhat easier - it involves estimating a delay > > > time for (roughly) the same signal. If your spoken numbers are very > > > similar in both cases (for example, because you copied the data from > > > the first signal and added zeros to delay it by 2 s) you can use > > > correlation to find the lag between the first and the second sentence. > > > If you have a priori knowledge of what the words (numbers) are then > > > this can be as simple as cutting out the silence between them to match > > > their onsets in time. > > > > > > But if you have recordings of different people, different numbers or > > > speak very differently in both cases the scenario quickly gets much > > > more complicated. The worst case would involve resorting to techniques > > > used in speech recognition to create a symbolic (maybe even phonetic) > > > representation of the spoken words that you can compare. > > > > > > Changing your voice to sound like somebody else's is even more > > > involved. Here the problem is that our ear is very good at detecting > > > anything unusual about a voice so you might not get a perfect result. > > > Second, there are many factors that determine the overall "sound" of a > > > voice - the way we speak (prosodic parameters including timing, > > > intonation, pronounciation, emphasis) and the "tone" of the voice(both> > > at the excitation and formant level). Be prepared that modelling all > > > these factors is a very demanding task... > > > -- > > > Stephan M. Bernsee > > > http://www.dspdimension.com > > > > > > > > >
Reply by ●October 22, 20042004-10-22
Well, this obviously very simple question isn't quite that simple either - let me explain: In the realm of Digital Signal Processing, you define the desired output of your process to be the "signal" and the undesired output to be "noise". Note that these definitions are somewhat arbitrary but we'll use them without defining them exactly for now. In these terms, your problem essentially boils down to the question "does knowledge of the signal alone help to improve its signal-to-noise ratio"? Now, if you know your signal *exactly*, the answer is "yes", because all you need to do is a simple subtraction. Obviously, that case isn't very useful because it practically never happens. The more general (and therefore useful) case would be that your observed (ie. noisy) signal is *similar* to what you are expecting. In that case, the numeric subtraction needs to be replaced by something that can be seen as a more "general" form of a subtraction. Note that it is the difference between "exact" and "similar" that causes you a headache here, because "similar" can practically mean anything. To get rid of the noise in the general case, one usually seeks to represent the observed data in a way that separates signal and noise from each other as much as possible, ie. use unique properties of the known signal to distinguish it from the noise (or vice-versa). It is easy to see that if the noise "behaves" very differently compared to the signal you might be lucky and achieve your goal. The more similar signal and noise are the more ambiguity will be introduced, and the less successful this will be. Back to your observed data with and without an interfering "effect" of some sort. Since you're not exactly defining the type of noise (the "effect") you have to deal with in your application, it is difficult to say if you can be successful. There's a whole world of possible interfering effects that can create all types of noise. As a rule of thumb, don't jump to the conclusion that once you have a definition of the outcome of an interfering process this would make the effect of that process on your signal orthogonal to the unprocessed signal and therefore easy to undo. In the majority of cases this is not possible, or at least not easy. Think about it: what if the effect was non-linear, or maybe time variant? Or if it would cause the speed of your signal to change? In each case, removing the interference would require a different approach. -- Stephan M. Bernsee http://www.dspdimension.com On 2004-10-22 13:46:33 +0200, "Jim Rex" <jimrex0@hotmail.com> said:> Thank you for your informed response. > > Hmmm... What I was trying to say was that given a clear recording of > someone's voice saying a certain sentence, and then having another recording > of that same person's voice saying that same sentence but with somekind of > noise or 'effect', how can one use this input and output to try to define > that noise or effect so that next time a person say's something and that > noise or effect is present, I can remove it because I have previously > 'modeled' that noise or 'effect' in someway. > > Thank you again. > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > news:2trm9gF237igvU1@uni-berlin.de... >> On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: >> >>> Hello, >>> >>> Let's say I have a signal that contains my voice saying "1 2 3" and then > I >>> have another signal that has my voice saying "1 2 3" but this time I > paused >>> for 2 seconds in the beginning. How can I create a filter that would > apply >>> the effect of delay? Or let's say I have my voice saying "1 2 3" and > someone >>> else's voice saying "1 2 3" and I want to try to capture that effect of >>> changing my voice to someone else's. >>> >>> I tried something like this, but it didn't work: >>> >>> a = first voice >>> b = second voice >>> >>> aa = fft(a); >>> bb = fft(b); >>> >>> cc = aa./bb; >>> >>> Shouldn't cc now encompass the effect I'm aiming for? >>> >>> >>> What technique should I use to get this done? >>> >>> >>> Thank you >> >> This is far more difficult than you would think. >> >> The first scenario is somewhat easier - it involves estimating a delay >> time for (roughly) the same signal. If your spoken numbers are very >> similar in both cases (for example, because you copied the data from >> the first signal and added zeros to delay it by 2 s) you can use >> correlation to find the lag between the first and the second sentence. >> If you have a priori knowledge of what the words (numbers) are then >> this can be as simple as cutting out the silence between them to match >> their onsets in time. >> >> But if you have recordings of different people, different numbers or >> speak very differently in both cases the scenario quickly gets much >> more complicated. The worst case would involve resorting to techniques >> used in speech recognition to create a symbolic (maybe even phonetic) >> representation of the spoken words that you can compare. >> >> Changing your voice to sound like somebody else's is even more >> involved. Here the problem is that our ear is very good at detecting >> anything unusual about a voice so you might not get a perfect result. >> Second, there are many factors that determine the overall "sound" of a >> voice - the way we speak (prosodic parameters including timing, >> intonation, pronounciation, emphasis) and the "tone" of the voice (both >> at the excitation and formant level). Be prepared that modelling all >> these factors is a very demanding task... >> -- >> Stephan M. Bernsee >> http://www.dspdimension.com
Reply by ●October 22, 20042004-10-22
Ah, now I see where you're going. This is not noise reduction but rather some kind of removal of the room impulse response that gets convolved with your signal. Google for the keyword "deconvolution" and "echo cancelling", they might lead you in the right direction, even though your application is still quite tricky to implement if you need it to work in the general case... -- Stephan M. Bernsee http://www.dspdimension.com On 2004-10-22 19:43:26 +0200, "Jim Rex" <jimrex0@hotmail.com> said:> It's like an actor doing 2 different takes saying the same material, but > with the 2nd one having some noise. > > What's i'm trying to do is take in someone's voice while he/she is close to > the microphone, and another one when he or she is far away, and I'm trying > to model this effect in some way. So next time a person is far away from the > microphone, I can apply an effect to make it sound as if the person is > talking directly into the microphone in a crisp and clear voice. >
Reply by ●October 22, 20042004-10-22
In that case, my suggestion about echo canceling does not apply. Spectral subtraction methods could still be used to remove "constant" noise*, as this technique only relies on having a bit of the recording where there is nothing but noise (so you wouldn't even need the clean recording). But this probably wouldn't be of much help with your scenario of making a distant-miked person sound close-miked. I don't think there is much hope for this today. *constant noise like a fan that is always on, or tape hiss, not traffic or wind noise that is changing. "Jim Rex" <jimrex0@hotmail.com> wrote in message news:DZmdnRlwk8kt2-TcRVn-qg@rogers.com...> It's like an actor doing 2 different takes saying the same material, but > with the 2nd one having some noise. > > What's i'm trying to do is take in someone's voice while he/she is close to > the microphone, and another one when he or she is far away, and I'm trying > to model this effect in some way. So next time a person is far away from the > microphone, I can apply an effect to make it sound as if the person is > talking directly into the microphone in a crisp and clear voice. > > "Jon Harris" <goldentully@hotmail.com> wrote in message > news:2tsspnF236o1rU1@uni-berlin.de... > > I think it depends on the type of "noise or effect" being applied. First > of > > all, is the voice on the 2 recordings the exact same thing? Not the same > person > > saying the same thing (like an actor doing 2 different takes of the same > > material), but actually the same recording? If so, there are solutions. > If the > > only difference is additive noise, there are noise reduction techniques > that can > > remove the noise. The ones I'm familiar with operate in the frequency > domain > > using "spectral subtraction". If the effect is a linear filter (including > > echo/reverb), then adaptive filters can estimate and compensate for the > filter. > > This is commonly performed in echo canceling applications. > > > > "Jim Rex" <jimrex0@hotmail.com> wrote in message > > news:jsCdnXeOn7aAbuXcRVn-1w@rogers.com... > > > Thank you for your informed response. > > > > > > Hmmm... What I was trying to say was that given a clear recording of > > > someone's voice saying a certain sentence, and then having another > recording > > > of that same person's voice saying that same sentence but with somekind > of > > > noise or 'effect', how can one use this input and output to try to > define > > > that noise or effect so that next time a person say's something and that > > > noise or effect is present, I can remove it because I have previously > > > 'modeled' that noise or 'effect' in someway. > > > > > > Thank you again. > > > > > > "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message > > > news:2trm9gF237igvU1@uni-berlin.de... > > > > On 2004-10-22 06:14:57 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > > > > > > > Hello, > > > > > > > > > > Let's say I have a signal that contains my voice saying "1 2 3" and > then > > > I > > > > > have another signal that has my voice saying "1 2 3" but this time I > > > paused > > > > > for 2 seconds in the beginning. How can I create a filter that would > > > apply > > > > > the effect of delay? Or let's say I have my voice saying "1 2 3" and > > > someone > > > > > else's voice saying "1 2 3" and I want to try to capture that effect > of > > > > > changing my voice to someone else's. > > > > > > > > > > I tried something like this, but it didn't work: > > > > > > > > > > a = first voice > > > > > b = second voice > > > > > > > > > > aa = fft(a); > > > > > bb = fft(b); > > > > > > > > > > cc = aa./bb; > > > > > > > > > > Shouldn't cc now encompass the effect I'm aiming for? > > > > > > > > > > > > > > > What technique should I use to get this done? > > > > > > > > > > > > > > > Thank you > > > > > > > > This is far more difficult than you would think. > > > > > > > > The first scenario is somewhat easier - it involves estimating a delay > > > > time for (roughly) the same signal. If your spoken numbers are very > > > > similar in both cases (for example, because you copied the data from > > > > the first signal and added zeros to delay it by 2 s) you can use > > > > correlation to find the lag between the first and the second sentence. > > > > If you have a priori knowledge of what the words (numbers) are then > > > > this can be as simple as cutting out the silence between them to match > > > > their onsets in time. > > > > > > > > But if you have recordings of different people, different numbers or > > > > speak very differently in both cases the scenario quickly gets much > > > > more complicated. The worst case would involve resorting to techniques > > > > used in speech recognition to create a symbolic (maybe even phonetic) > > > > representation of the spoken words that you can compare. > > > > > > > > Changing your voice to sound like somebody else's is even more > > > > involved. Here the problem is that our ear is very good at detecting > > > > anything unusual about a voice so you might not get a perfect result. > > > > Second, there are many factors that determine the overall "sound" of a > > > > voice - the way we speak (prosodic parameters including timing, > > > > intonation, pronounciation, emphasis) and the "tone" of the voice > (both > > > > at the excitation and formant level). Be prepared that modelling all > > > > these factors is a very demanding task... > > > > -- > > > > Stephan M. Bernsee > > > > http://www.dspdimension.com > > > > > > > > > > > > > > > >
Reply by ●October 22, 20042004-10-22
I think deconvolution applies (or perhaps dereverberation), but I'm not so sure about echo canceling, at least in the traditional teleconferencing usage. In that context, echo canceling requires a clean reference signal and an "effected" signal that can be derived from the reference by (primarily) linear processes. I think what Jim is probably after is blind, time-variant dereverberation, which is a very difficult nut to crack! -Jon "Stephan M. Bernsee" <spam@dspdimension.com> wrote in message news:2tt2d6F22iid5U1@uni-berlin.de...> > Ah, now I see where you're going. This is not noise reduction but > rather some kind of removal of the room impulse response that gets > convolved with your signal. Google for the keyword "deconvolution" and > "echo cancelling", they might lead you in the right direction, even > though your application is still quite tricky to implement if you need > it to work in the general case... > -- > Stephan M. Bernsee > http://www.dspdimension.com > > On 2004-10-22 19:43:26 +0200, "Jim Rex" <jimrex0@hotmail.com> said: > > > It's like an actor doing 2 different takes saying the same material, but > > with the 2nd one having some noise. > > > > What's i'm trying to do is take in someone's voice while he/she is close to > > the microphone, and another one when he or she is far away, and I'm trying > > to model this effect in some way. So next time a person is far away from the > > microphone, I can apply an effect to make it sound as if the person is > > talking directly into the microphone in a crisp and clear voice. > > >
Reply by ●October 23, 20042004-10-23
Subtractive noise reduction also tends to degrade the signal quite a bit because signal and noise usually overlap in the spectral domain, so he will end up with something that may not be usable for his purpose... -- Stephan M. Bernsee http://www.dspdimension.com On 2004-10-22 21:19:28 +0200, "Jon Harris" <goldentully@hotmail.com> said:> In that case, my suggestion about echo canceling does not apply. Spectral > subtraction methods could still be used to remove "constant" noise*, as this > technique only relies on having a bit of the recording where there is nothing > but noise (so you wouldn't even need the clean recording). But this probably > wouldn't be of much help with your scenario of making a distant-miked person > sound close-miked. I don't think there is much hope for this today. > > *constant noise like a fan that is always on, or tape hiss, not traffic or wind > noise that is changing. >






