DSPRelated.com
Forums

speech recognition

Started by RichD January 19, 2012
"RichD"  wrote in message 
news:4b7edabe-1efe-4e39-9869-fcab7035b613@pt5g2000pbb.googlegroups.com...
> >http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2012/01/16/BU8C1MOO20.DTL > > >He claims he can filter speech from background noise. > >I recall discussing this possibility years ago. Someone said, these >filters already exist. They do - they're notch filters! It's close >to brain dead, believing that constitutes 'voice filtering'. > >Dr. Watts has been working on this for years, so I was wondering >what techniques he's using, how much is public domain. Anyone >here know anything about the subject, or this product? >Is it neural nets, DSP filters, or what? > > >-- >Rich
Sorry, it generally is impossible because what is "Background noise" to one person maybe perfectly valid to another. I assume by background noise you mean other voices. If one wants a certain voice then it better have certain properties that allow it to be distinguished from the others. If not it is no different and cannot be singled out. If the voice to be "filtered" has the property of closer proximity then it must be louder in the mix. There are many possible ways to improve it's SNR where by noise we mean the "background noise". One could use multiple mics. There are actually mics that essentially do this automatically by having a unidirectional mic pick up the background facing away from the user and a unidirectional mic that faces towards the user. By subtracting the two you'll reduce the background noise(which will generally show up in both mics equally). Also, background voices tend to have a different frequency response since high frequencies will be dampened by all the stuff in the room. So one could apply filters to keep the high end of the closest voice. In general it's going to be very specific and it requires more detail on the types of noises one is referring to. If you have 10 people all in a circle with a single mic at the center talking at the same intensity there is not going to be any real way to process the output to get a single arbitrary voice with any impressive results.
On Jan 20, 7:24&#4294967295;pm, "Jeffery Tomas" <Jeffery_To...@Gmail.com> wrote:
> "RichD" &#4294967295;wrote in message > > news:4b7edabe-1efe-4e39-9869-fcab7035b613@pt5g2000pbb.googlegroups.com... > > > > > > >http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2012/01/16/BU8C1MOO2... > > >He claims he can filter speech from background noise. > > >I recall discussing this possibility years ago. &#4294967295;Someone said, these > >filters already exist. &#4294967295;They do - they're notch filters! &#4294967295;It's close > >to brain dead, believing that constitutes 'voice filtering'. > > >Dr. Watts has been working on this for years, so I was wondering > >what techniques he's using, how much is public domain. &#4294967295;Anyone > >here know anything about the subject, or this &#4294967295;product? > >Is it neural nets, DSP filters, or what? > > >-- > >Rich > > Sorry, it generally is impossible because what is "Background noise" to one > person maybe perfectly valid to another. I assume by background noise you > mean other voices. If one wants a certain voice then it better have certain > properties that allow it to be distinguished from the others. If not it is > no different and cannot be singled out. > > If the voice to be "filtered" has the property of closer proximity then it > must be louder in the mix. There are many possible ways to improve it's SNR > where by noise we mean the "background noise". One could use multiple mics. > There are actually mics that essentially do this automatically by having a > unidirectional mic pick up the background facing away from the user and a > unidirectional mic that faces towards the user. By subtracting the two > you'll reduce the background noise(which will generally show up in both mics > equally). > > Also, background voices tend to have a different frequency response since > high frequencies will be dampened by all the stuff in the room. So one could > apply filters to keep the high end of the closest voice. > > In general it's going to be very specific and it requires more detail on the > types of noises one is referring to. If you have 10 people all in a circle > with a single mic at the center talking at the same intensity there is not > going to be any real way to process the output to get a single arbitrary > voice with any impressive results.
But with 10 mics - maybe. ICA uses the PDF of speech (normally Laplace distribution ie non Guassian) as that extra information. It uses a distance measure between 2 or more PDFs - Kullback&#4294967295;Leibler divergence. Essentially this is using higher order statistics and cannot work under Guassian conditions. Mind you. I have yet to hear a really good example of this working!! Hardy
Jamie wrote:
> RichD wrote: > >> http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2012/01/16/BU8C1MOO20.DTL >> >> He claims he can filter speech from background noise.
But what if the background noise is from a cocktail party?
>> >> I recall discussing this possibility years ago. Someone said, these >> filters already exist. They do - they're notch filters! It's close >> to brain dead, believing that constitutes 'voice filtering'. >> >> Dr. Watts has been working on this for years, so I was wondering >> what techniques he's using, how much is public domain. Anyone >> here know anything about the subject, or this product? >> Is it neural nets, DSP filters, or what? >> >> -- >> Rich > For what ever reason, I have never been able to get a speech to > text working here well enough with my voice to make it useable with > out detecting errors in miss use of words or at times, totally incorrect > words. But all of them seem to work well with woman voices from what > I've seen.
You might benefit from a better microphone then. Mens voices have important low frequency components that the average cheapo PC mike will tend to lose - although the sound system is capable of measuring them. Womens voices are a lot clearer in a noisy environment. And they are basically up a gum tree when there are soundalikes and homophones that require knowledge of English grammar to sort out. "Rows and rows of roses" or "which witch" for instance. It is even worse for a machine when in natural speach we tend to run adjacent words into each other, with no clear gaps creating new ambiguities that catch out even the best realtime subtitling kit. And then there is variable vowel length and/or tone which may be siginicant or not depending on the spoken language.
> > I understand the training cycle you need to perform in such tools to > build a profile for your voice. > > The latest Dragon Speech does seem to work well however, it is not so > much just being able to correlate with my voice, it seems to have issues > deciding what is, as is and what is as CMDS. > > The technology has come a long ways and I can remember the first one I > tried, which was for Windows 3.x and found it to work amazingly well for > such things back then. Even the speed response was good however, it > seems that as hardware speeds up, the software gets bloated > proportionally as they add more things, use newer tools that just puts > more un-wanted bloat in your code.
Converting natural fluid speech to accurate text is a hard problem - especially if it has to work for any voice as opposed to just the one or two that it has been trained on. Converting voice to something that sounds roughly alike is slightly easier. The strange mistakes they make on news subtitles show just how hard it really is. Proper names tend to get creatively mangled.
> > Maybe we'll stop adding layers to these tools one day.
Getting the error rate down to below 0.1% is really hard even with a grammar checker on the output. Until the error rate is at least comparable with typing it is more of a novelty than a tool. (though still useful in some circumstances like subtitled news) Regards, Martin Brown
On Jan 19, 11:24&#4294967295;pm, "Jeffery Tomas" <Jeffery_To...@Gmail.com> wrote:
> "RichD" &#4294967295;wrote in message > > news:4b7edabe-1efe-4e39-9869-fcab7035b613@pt5g2000pbb.googlegroups.com... > > > > > > > > >http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2012/01/16/BU8C1MOO2... > > >He claims he can filter speech from background noise. > > >I recall discussing this possibility years ago. &#4294967295;Someone said, these > >filters already exist. &#4294967295;They do - they're notch filters! &#4294967295;It's close > >to brain dead, believing that constitutes 'voice filtering'. > > >Dr. Watts has been working on this for years, so I was wondering > >what techniques he's using, how much is public domain. &#4294967295;Anyone > >here know anything about the subject, or this &#4294967295;product? > >Is it neural nets, DSP filters, or what? > > >-- > >Rich > > Sorry, it generally is impossible because what is "Background noise" to one > person maybe perfectly valid to another. I assume by background noise you > mean other voices. If one wants a certain voice then it better have certain > properties that allow it to be distinguished from the others. If not it is > no different and cannot be singled out. > > If the voice to be "filtered" has the property of closer proximity then it > must be louder in the mix. There are many possible ways to improve it's SNR > where by noise we mean the "background noise". One could use multiple mics. > There are actually mics that essentially do this automatically by having a > unidirectional mic pick up the background facing away from the user and a > unidirectional mic that faces towards the user. By subtracting the two > you'll reduce the background noise(which will generally show up in both mics > equally). > > Also, background voices tend to have a different frequency response since > high frequencies will be dampened by all the stuff in the room. So one could > apply filters to keep the high end of the closest voice. > > In general it's going to be very specific and it requires more detail on the > types of noises one is referring to. If you have 10 people all in a circle > with a single mic at the center talking at the same intensity there is not > going to be any real way to process the output to get a single arbitrary > voice with any impressive results.
Your argument is rational since the human brain doesn't seem to be able to do that very well either. Then again, may be overcoming the cocktail party effect is difficult because it is unusual and is not a normal situation, most people have two ears. ...it is tempting to think along different lines, apply AI to 'analyze' the sounds, matching the phonemes(sp?) to the actual words/ speech until, voila! it becomes possible to filter based, not upon origin position, loudness, but on the true characteristics of the voice, much like a person 'recognizes' who is speaking, and by that recognition can begin to filter other speakers out, ...a bit.
On Jan 20, 1:26&#4294967295;am, Martin Brown <|||newspam...@nezumi.demon.co.uk>
wrote:
> Jamie wrote: > > RichD wrote: > > >>http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2012/01/16/BU8C1MOO2... > > >> He claims he can filter speech from background noise. > > But what if the background noise is from a cocktail party? > > > > > > > > >> I recall discussing this possibility years ago. &#4294967295;Someone said, these > >> filters already exist. &#4294967295;They do - they're notch filters! &#4294967295;It's close > >> to brain dead, believing that constitutes 'voice filtering'. > > >> Dr. Watts has been working on this for years, so I was wondering > >> what techniques he's using, how much is public domain. &#4294967295;Anyone > >> here know anything about the subject, or this &#4294967295;product? > >> Is it neural nets, DSP filters, or what? > > >> -- > >> Rich > > &#4294967295;For what ever reason, I have never been able to get a speech to > > text working here well enough with my voice to make it useable with > > out detecting errors in miss use of words or at times, totally incorrect > > words. But all of them seem to work well with woman voices from what > > I've seen. > > You might benefit from a better microphone then. Mens voices have > important low frequency components that the average cheapo PC mike will > tend to lose - although the sound system is capable of measuring them. > Womens voices are a lot clearer in a noisy environment. > > And they are basically up a gum tree when there are soundalikes and > homophones that require knowledge of English grammar to sort out. > > "Rows and rows of roses" or "which witch" for instance. > > It is even worse for a machine when in natural speach we tend to run > adjacent words into each other, with no clear gaps creating new > ambiguities that catch out even the best realtime subtitling kit. And > then there is variable vowel length and/or tone which may be siginicant > or not depending on the spoken language. > > > > > &#4294967295; I understand the training cycle you need to perform in such tools to > > build a profile for your voice. > > > &#4294967295; The latest Dragon Speech does seem to work well however, it is not so > > much just being able to correlate with my voice, it seems to have issues > > deciding what is, as is and what is as CMDS. > > > &#4294967295; The technology has come a long ways and I can remember the first one I > > tried, which was for Windows 3.x and found it to work amazingly well for > > such things back then. Even the speed response was good however, it > > seems that as hardware speeds up, the software gets bloated > > proportionally as they add more things, use newer tools that just puts > > more un-wanted bloat in your code. > > Converting natural fluid speech to accurate text is a hard problem - > especially if it has to work for any voice as opposed to just the one or > two that it has been trained on. Converting voice to something that > sounds roughly alike is slightly easier. The strange mistakes they make > on news subtitles show just how hard it really is. > > Proper names tend to get creatively mangled. > > > > > &#4294967295; Maybe we'll stop adding layers to these tools one day. > > Getting the error rate down to below 0.1% is really hard even with a > grammar checker on the output. Until the error rate is at least > comparable with typing it is more of a novelty than a tool. > (though still useful in some circumstances like subtitled news) > > Regards, > Martin Brown
Google's telephone/voice service after recording an incoming voice message, provides 'voice to text' conversion which is then emailed to you. A bit of a chuckle at all the errors, EXCEPT call back numbers are extremely accurate. Plus you can hit "Listen to the message" while reading the email to reinforce what you're reading.
HardySpicer <gyansorova@gmail.com> wrote:
> On Jan 20, 7:05=A0am, c...@kcwc.com (Curt Welch) wrote: > > "Jesse F. Hughes" <je...@phiwumbda.org> wrote: > > > > > Phil Hobbs <pcdhSpamMeSensel...@electrooptical.net> writes: > > > > > > A friend of mine, Professor Dana Anderson of the University of > > > > Colorado, Boulder, made a statistics-based digital filter that > > > > could separate different kinds of music mixed together, as well as > > > > music fr= > om > > > > noise. The demo was really striking--you mix together, say jazz and > > > > classical music from two MP3 players, feed it through the gizmo, > > > > and after (iirc) about 10 seconds of learning, classical comes out > > > > of one speaker and jazz out of the other. =A0 Magic > > > > stuff--published in IEEE Acoustics around 2006, I think. > > > > > That sounds really impressive, if it works as well as you describe. > > > > Here's a great little web demo of ICA - Independent Component Analyses. > > = > =A0It > > can separate sources mixed together when recorded in different > > "microphones" (I assume the demo is just a mathematical mixing and not > > do= > ne > > by recording). > > > > http://research.ics.tkk.fi/ica/cocktail/cocktail_en.cgi > > > > This approach makes the assumption that the source signals are linearly > > mixed together at different levels in each microphone recording (due to > > t= > he > > different distances each source is away from the microphone) but can > > separate as many different sources as you have microphones. > > > > More info: > > > > http://en.wikipedia.org/wiki/Independent_component_analysis > > > > I would guess the telephone technology is using something similar since > > they added a second microphone. > > > > The only statistical requirement for this to work is that the sources > > mus= > t > > have a non-Gaussian distribution. > > > > BTW, this stuff is WAY past "notch filters" in complexity and power and > > performance. > > > > -- > > Curt Welch =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 > > = > =A0 =A0 =A0 =A0 =A0 =A0 =A0http://CurtWelch.Com/ > > c...@kcwc.com =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 > > =A0= > =A0 =A0 =A0 =A0 =A0http://NewsReader.Com/ > > again, don't be so impressed! Depends how they are mixed! Simple > constant matrix mixing is quite easy to separate whereas more > realistic convolutive mixing is much harder. in real acoustic > environments the mixing polynomial matrix is more than often non-min > phase too and of a very high order in some environments. > > Hardy
I'm sure the web example mixing was done mathematically with a constant matrix. But I've seen examples of real recordings that do work very well - however the recording was probably done in a controlled studio-like environment which reduced interference with other sources and real world interference due to effects such as room echo. I don't know how well these techniques hold up in a more complex dynamic environment like that. However, it would not surprise me to find out that applying these types of techniques to a simple phone application where the goal was to increase voice volume and reduce background volume could work fairly well. -- Curt Welch http://CurtWelch.Com/ curt@kcwc.com http://NewsReader.Com/
"Jeffery Tomas" <Jeffery_Tomas@Gmail.com> wrote:
> "Curt Welch" wrote in message > >news:20120119130524.248$HD@newsreader.com... > >"Jesse F. Hughes" <jesse@phiwumbda.org> wrote: > >> Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> writes: > >> > >> > A friend of mine, Professor Dana Anderson of the University of > >> > Colorado, Boulder, made a statistics-based digital filter that could > >> > separate different kinds of music mixed together, as well as music > >> > from noise. The demo was really striking--you mix together, say jazz > >> > and classical music from two MP3 players, feed it through the gizmo, > >> > and after (iirc) about 10 seconds of learning, classical comes out > >> > of one speaker and jazz out of the other. Magic stuff--published > >> > in IEEE Acoustics around 2006, I think. > >> > >> That sounds really impressive, if it works as well as you describe. > > > >Here's a great little web demo of ICA - Independent Component Analyses. > >It can separate sources mixed together when recorded in different > >"microphones" (I assume the demo is just a mathematical mixing and not > >done by recording). > > > >http://research.ics.tkk.fi/ica/cocktail/cocktail_en.cgi > > > >This approach makes the assumption that the source signals are linearly > >mixed together at different levels in each microphone recording (due to > >the different distances each source is away from the microphone) but can > >separate as many different sources as you have microphones. > > > >More info: > > > >http://en.wikipedia.org/wiki/Independent_component_analysis > > > >I would guess the telephone technology is using something similar since > >they added a second microphone. > > > >The only statistical requirement for this to work is that the sources > >must have a non-Gaussian distribution. > > > >BTW, this stuff is WAY past "notch filters" in complexity and power and > >performance. > > > > This doesn't seem very impressive. Essentially one has a linear > combination of the sounds > > B = Sum(a_k*A_k) > > where A_k is the original audio and a_k is how much it contributes to a > mic. Given n mic's we have n such equations > > B_i = Sum(a_(k,i)*A_k) > > All it takes is simple linear algebra to recover the original A_k's. The > a_(k,i)'s could easily be estimated by since they are in direct > proportion to the mic placement. My guess why the demo is sounds good is > because they use the exact coefficients use to create the mixed signals > in the first place. I would bet the real world scenario would be must > worse.
I've seen real world dual-mike recording demos and they work just as well. But the real world demo was probably carefully controlled. I don't know how well it will work for example if the signals sources were moving during the recording so that their mixing coefficients were dynamically changing. I'm fairly sure these techniques don't solve that problem, but it would be interesting to see how close they get to a "good" answer. The mixing is done with a simple coefficient, but the separation is done without any information about the sources. It's a field generally known as blind source separation: http://en.wikipedia.org/wiki/Blind_signal_separation It's very cool that such things can be done mathematically. There's no "estimation" involved. It's precisely calculated from the mixed signals. Here's an interesting paper that talks about these sorts of techniques... http://www.mit.edu/~gari/teaching/6.222j/ICASVDnotes.pdf I don't fully understand it, but I'm attempting to learn. I'd like to see what results are produced if you mix three signals into two microphones and use this technique to produce two separations. What would it produce? Would two of the three signals be separated, but the third would be mixed into both results? Or would it just fail to really "lock on" to any of the signals, so that both outputs would continue to be a mix of all three input sources? -- Curt Welch http://CurtWelch.Com/ curt@kcwc.com http://NewsReader.Com/
"Jeffery Tomas" <Jeffery_Tomas@Gmail.com> wrote:
> "RichD" wrote in message > news:4b7edabe-1efe-4e39-9869-fcab7035b613@pt5g2000pbb.googlegroups.com... > > > >http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2012/01/16/BU8C1MOO20.D > >TL > > > > > >He claims he can filter speech from background noise. > > > >I recall discussing this possibility years ago. Someone said, these > >filters already exist. They do - they're notch filters! It's close > >to brain dead, believing that constitutes 'voice filtering'. > > > >Dr. Watts has been working on this for years, so I was wondering > >what techniques he's using, how much is public domain. Anyone > >here know anything about the subject, or this product? > >Is it neural nets, DSP filters, or what? > > > > > >-- > >Rich > > Sorry, it generally is impossible because what is "Background noise" to > one person maybe perfectly valid to another. I assume by background noise > you mean other voices. If one wants a certain voice then it better have > certain properties that allow it to be distinguished from the others. If > not it is no different and cannot be singled out. > > If the voice to be "filtered" has the property of closer proximity then > it must be louder in the mix. There are many possible ways to improve > it's SNR where by noise we mean the "background noise". One could use > multiple mics. There are actually mics that essentially do this > automatically by having a unidirectional mic pick up the background > facing away from the user and a unidirectional mic that faces towards the > user. By subtracting the two you'll reduce the background noise(which > will generally show up in both mics equally). > > Also, background voices tend to have a different frequency response since > high frequencies will be dampened by all the stuff in the room. So one > could apply filters to keep the high end of the closest voice. > > In general it's going to be very specific and it requires more detail on > the types of noises one is referring to. If you have 10 people all in a > circle with a single mic at the center talking at the same intensity > there is not going to be any real way to process the output to get a > single arbitrary voice with any impressive results.
I don't think that's actually true. I think it could be done. It would require the system to be able to model the voices (based on past experience of hearing many different voices). If it could match its models to the different voices it could separate the voices. It's all a function of how good the models were. This is no different than what would be done if each "speaker" was admitting a pure sine wave with each speaking "speaking" at a different frequency. We can isolate the different speakers in this case because we can model what the speakers should "be like" (aka each a perfect sine wave). This is just a very trivial example, and one very well studied for how to "voice separation" when each of the voices are pure sin waves. The better the system can model what the voices is most likely to be, the better it will be able to separate the speakers in a given signal. On, in effect, if we try some type of separation, and the resulting voice doesn't sound like a single person talking, We know the separation is wrong. If you can tell if a separation is right or wrong (which I'm sure as humans we could tell that), then it should be possible (in theory) to build a system to automatically separate the signals if it had a good understanding of what a human voice should sound like. -- Curt Welch http://CurtWelch.Com/ curt@kcwc.com http://NewsReader.Com/

fatalist wrote:


> Audience's patents are available to anyone for viewing: > Some critical ingredients are clearly missing in those patents, most > notably "pitch detection"
Hi Dmitry Teres, Perhaps the Earth is revolving around your pitch detection algorithm.
The idea is very basic

The outputs(the mic's) are a linear combination of the input sources. The 
coefficients give the relative magnitude of each source contributing to the 
mic. If a source is further away it will contribute less to the output. In 
fact, the coefficient should be 1/distance^2. If the distance is large 
1/distance^2 is small and vice versa.

O_i = sum(1/a_k^2*S_k)

or

O = A*S

where O is the output's(mics), A is the "conversion matrix" and S is the 
sources(which is what we want to recover). From basic linear algebra we know 
S = A^-1*O.


There are statistical and deterministic methods to determine A since it is 
essentially the "transfer function". The important thing is to recognize is: 
If A is known and the sources and sinks(mics) are time independent(or at 
least slowly changing) then we can recover the sources. The more accurate we 
know A the more accurate our decomposition will be.

I'll demonstrate for two sources and sinks but I'll use inverted linear 
distance wlog

O1 = a*S1 + b*S2
O2 = c*S1 + d*S2

Inverting gives

S1 = (dO1 - bO2)/(ad - cb)
S2 = -(cO1 - aO2)/(ad - cb)


Now the problem is to determine the coefficients.

This is simple a problem of plugging in values:

Analogy: y = a + bx + cx^2

How do we find a b and c? Well if we know 3 pairs of points we just plug in 
and solve the linear system of equations.

Unfortunately we do not know the Source vector. BUT if we "listen" for 
common situations we can easily narrow down the search and possibly arrive 
at correct coefficients.

Now since we know the outputs we know there ratio. Assume O1/O2 = x and 
S1/S2 = y then

y = -(dx - b)/(cx - a)

now this is like the polynomial sampling problem I gave above. We just don't 
have our points (x,y) to determine the coefficients.

There are several ways to potentially get y such as negative feedback, 
guestimating, kalman filters, monte carlo, etc...

e.g., if we guess at the coefficients we arrive at a certain geometrical 
scenario that will either coincide with the real one or be "off"(producing 
estimated outputs that do not jive with what we are actually getting... we 
can perturb our coefficients until they "wiggle" there way to the correct 
values using appropriate adaptive methods).

If we know certain properties of the sources then we can include those in 
the algorithms to reduce the complexity/increase the effectiveness of the 
algorithms.

For example, by knowing that S1 and S2 are uncorrelated we can determine 
that when S2 = 0 then we end up with a 1 source system making it much easier 
to solve. e.g., our equation reduces to

S1 = aO1 + bO2

and instead of 4 coefficients we only have to find 2. This results in a 
triangular geometry instead of a quadrilateral. Moreover, by symmetric we 
can do the same for S2 and we end up with another easy system to solve. This 
applies to any number of sources.  With uncorrelated sources we should be 
able to treat the system as just one source a time.

For such indepence the coefficients are directly related to the distances 
and can be easily calculated from the mic intensities. (a = 1/d1^2 and b = 
1/d2^2 but a is the level from O1 and b is the level from O2)

In general though it is not the case that the sources are uncorrelated or 
time independent and more advanced tricks or general approximations may have 
to be used.

The more I think about it the more it seems like it might be quite accurate 
for most signals and setups. One problem though is frequency response as we 
are assuming a lossless transmission which is usually not the case and could 
cause problems with some algorithmic approaches.