DSPRelated.com
Forums

The Human Voice has been widely characterized, correct?

Started by Ramon F Herrera January 11, 2015
On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera
<ramon@conexus.net> wrote:

>Let me ask the question in another way: > >In addition to having a range from X Hertz to Y Kilohertz: > > http://en.wikipedia.org/wiki/Vocal_range > >What other characteristics does our voice have?
Plenty! Consider that voiced sounds are produced by a stream of glottal pulses exciting a multiply-resonant cavity. The glottal pulse rate and the specific resonant frequencies (formants) are variable. If you look at a speech spectrogram (see <http://www.daqarta.com/dw_sgram.htm>) you can see the harmonics of the glottal pulses as stacked roughly-horizontal streaks (upper image). The streaks are horizontal during prolonged vowels, but they swoop up or down on the consonants. Some may move up while others move down... that's what distinguishes 'ba' from 'da', for example. The lower image on that page shows the same utterance at higher time resolution, but with reduced frequency resolution, such that the horizontal harmonic streaks are blurred together, but the individual glottal pulses are clearly seen. The vertical streaks at the start of each syllable in both images are noise bursts, which are also shaped by the individual formant frequencies of the resonant filters. The example utterance doesn't include sibilants like "s", but those would be prolonged noise bursts. (A whisper is essentially broadband noise that has passed through the formant filters... no glottal pulses or harmonics.) If your software can figure out where each syllable starts and stops, it can erase everything between syllables. To remove noise *during* the syllables, it would need to figure out which noise is "speech noise" and which is unwanted noise. That might be done during voiced sounds by erasing everything that's not near a harmonic of the glottal pulse. For the unvoiced stuff, you'd need to figure out the formant frequencies and leave the noise near those, but remove it elsewhere. But you don't want to remove the entire vertical streaks. After you figure out how to do all that, the singing voice should be a piece of cake! Best regards, Bob Masta DAQARTA v7.60 Data AcQuisition And Real-Time Analysis www.daqarta.com Scope, Spectrum, Spectrogram, Sound Level Meter Frequency Counter, Pitch Track, Pitch-to-MIDI FREE Signal Generator, DaqMusiq generator Science with your sound card!
On 1/12/2015 8:36 AM, Bob Masta wrote:
> On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera > <ramon@conexus.net> wrote: > >> Let me ask the question in another way: >> >> In addition to having a range from X Hertz to Y Kilohertz: >> >> http://en.wikipedia.org/wiki/Vocal_range >> >> What other characteristics does our voice have? > > Plenty! Consider that voiced sounds are produced by a > stream of glottal pulses exciting a multiply-resonant > cavity. The glottal pulse rate and the specific resonant > frequencies (formants) are variable. If you look at a > speech spectrogram (see > <http://www.daqarta.com/dw_sgram.htm>) > you can see the harmonics of the glottal pulses as stacked > roughly-horizontal streaks (upper image).
Thanks for your kind assistance, Bob. What puzzles me is the demarcation line between: (a) This is a human-produced sound (ie, voice) (b) This one is not. The human vocal system cannot possibly phonate this utterance. In addition to the obvious frequency range, I came up with 2 possible parameters: First, the fact that we breath and therefore have to take pauses. Alas, this is a rather weak hook/discriminator, since some speakers may breath during speech (see ventriloquists, etc). This is the interesting one: If you play a voice segment backwards you will end up with *exactly* the same spectral footprint as the original but with a sound that belongs in category (b) above. There are some mini-explosions in speech which are irreversible. -Ramon
On 1/12/2015 8:36 AM, Bob Masta wrote:
> After you figure out how to do all that, the singing voice > should be a piece of cake!
Me !!?? Why me??? You guys are the comp.dsp geniuses. You and all those professors, researchers and book authors. As far as I know, so far, the total extent of material publicly available is some TIMIT (6300 short sound clips) database. Is everything else locked up in Apple's and Dragon's software safe? Some enterprising researcher who comes up with a model (preferably with source code!) of: - What is a human voice - What is not Should be admired, lavishly feted, pursued by Hollywood starlets, etc. You are in essence telling me to replicate all the work that is probably out there. BTW: I downloaded your Daqarta and am playing with it... Regards, -Ramon "We stood on the shoulders of giants".
On Monday, January 12, 2015 at 5:27:03 PM UTC+13, Ramon F Herrera wrote:
> On 1/11/2015 2:18 PM, gyansorova@gmail.com wrote: > > Normally we use classification methods eg SVM - a bit like speech recognition. You need of course a data-base or Corpus to train your system. There are a number of such Corpuses online eg TIMIT. > > I found this one: > > https://catalog.ldc.upenn.edu/LDC93S1 > > At $250 it is too rich for my blood. Even if I had the money, do I > really need that? > > In theory, what I need is the UNION (in the set theory sense) of all the > speeches ever uttered by human beings since the vocal chords were > developed. You guys will notice that, being a lazy guy, I am taking the > opposite (easy) route. > > That UNION is my universe of sounds that will be considered positive by > my system. The rest (Boolean complement) is negative, ergo noise and do > not belong in the audio clipping that I am trying to clean. > > Let me ask the question in another way: > > In addition to having a range from X Hertz to Y Kilohertz: > > http://en.wikipedia.org/wiki/Vocal_range > > What other characteristics does our voice have? > > TIA, > > -Ramon
Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use.
Den mandag den 12. januar 2015 kl. 10.27.44 UTC+1 skrev Mac Decman:
> On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera > <ramon@conexus.net> wrote: > > <snip> > > > >In addition to having a range from X Hertz to Y Kilohertz: > > > > http://en.wikipedia.org/wiki/Vocal_range > > > >What other characteristics does our voice have? > > > >TIA, > > > >-Ramon > > > > > > Hah, this is too funny. > > Mark
http://youtu.be/5hyI_dM5cGo -Lasse
On 1/13/2015 3:31 PM, gyansorova@gmail.com wrote:
> Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use. >
Thanks again, gyansorova... For a while, I have been looking all over the place and I cannot get an answer to my curiosity. My interest is not exactly voice recognition per se, but highly related. I have been reading (glancing) about different VR systems, and one of their main desirable features is how they remain working in the presence of NOISE. What I need is the *opposite*. I would like to tune a VR system in the other direction, so it breaks quickly and proclaims: "That is NOT a human voice". So it would be an expert at *Noise Recognition*. Such feature would be at the core of a Noise Removal System. This is the target audio clip that I have in mind: https://www.youtube.com/watch?v=a-EAqNbpcYE I figure that if I can remove the noise from that one, I can remove it from almost anything. Notice that it has: - traffic - breathing (wind?) - a helicopter - a multi-frequency screeching train - a crying baby - even a tweeting bird! The good news is that the subject has a deep voice. Additionally, I don't care whether the application has to spend hours and hours proffering different variations (brute force guesses) until the VR part says: "Now we are talking! -pun intended- That sounds like a human voice!" Can you please issue an educated guess? Does my idea seem reasonable? TIA, -Ramon
On Tue, 13 Jan 2015 13:52:35 -0800 (PST), langwadt@fonz.dk wrote:

>Den mandag den 12. januar 2015 kl. 10.27.44 UTC+1 skrev Mac Decman: >> On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera >> <ramon@conexus.net> wrote: >> >> <snip> >> > >> >In addition to having a range from X Hertz to Y Kilohertz: >> > >> > http://en.wikipedia.org/wiki/Vocal_range >> > >> >What other characteristics does our voice have? >> > >> >TIA, >> > >> >-Ramon >> > >> > >> >> Hah, this is too funny. >> >> Mark > >http://youtu.be/5hyI_dM5cGo > >-Lasse
Nice! Eric Jacobsen Anchor Hill Communications http://www.anchorhill.com
On Tue, 13 Jan 2015 16:15:25 -0600, Ramon F Herrera
<ramon@patriot.net> wrote:

>On 1/13/2015 3:31 PM, gyansorova@gmail.com wrote: >> Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use. >> > > >Thanks again, gyansorova... > >For a while, I have been looking all over the place and I cannot get an >answer to my curiosity. My interest is not exactly voice recognition per >se, but highly related. > >I have been reading (glancing) about different VR systems, and one of >their main desirable features is how they remain working in the presence >of NOISE. > >What I need is the *opposite*. I would like to tune a VR system in the >other direction, so it breaks quickly and proclaims: > >"That is NOT a human voice". > >So it would be an expert at *Noise Recognition*. > >Such feature would be at the core of a Noise Removal System. > >This is the target audio clip that I have in mind: > > https://www.youtube.com/watch?v=a-EAqNbpcYE > >I figure that if I can remove the noise from that one, I can remove it >from almost anything. > >Notice that it has: > > - traffic > - breathing (wind?) > - a helicopter > - a multi-frequency screeching train > - a crying baby > - even a tweeting bird! > >The good news is that the subject has a deep voice. Additionally, I >don't care whether the application has to spend hours and hours >proffering different variations (brute force guesses) until the VR part >says: > >"Now we are talking! -pun intended- That sounds like a human voice!" > >Can you please issue an educated guess? Does my idea seem reasonable?
The problem is that recognizing the presence or absence of a human voice (or even recognizing the words!) doesn't help with removing noise. It's not like you can say "aha, human voice... now I'll just subtract out everything that *isn't* a human voice." Best regards, Bob Masta DAQARTA v7.60 Data AcQuisition And Real-Time Analysis www.daqarta.com Scope, Spectrum, Spectrogram, Sound Level Meter Frequency Counter, Pitch Track, Pitch-to-MIDI FREE Signal Generator, DaqMusiq generator Science with your sound card!
On Thursday, January 15, 2015 at 2:34:04 AM UTC+13, Bob Masta wrote:
> On Tue, 13 Jan 2015 16:15:25 -0600, Ramon F Herrera > <ramon@patriot.net> wrote: > > >On 1/13/2015 3:31 PM, gyansorova@gmail.com wrote: > >> Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use. > >> > > > > > >Thanks again, gyansorova... > > > >For a while, I have been looking all over the place and I cannot get an > >answer to my curiosity. My interest is not exactly voice recognition per > >se, but highly related. > > > >I have been reading (glancing) about different VR systems, and one of > >their main desirable features is how they remain working in the presence > >of NOISE. > > > >What I need is the *opposite*. I would like to tune a VR system in the > >other direction, so it breaks quickly and proclaims: > > > >"That is NOT a human voice". > > > >So it would be an expert at *Noise Recognition*. > > > >Such feature would be at the core of a Noise Removal System. > > > >This is the target audio clip that I have in mind: > > > > https://www.youtube.com/watch?v=a-EAqNbpcYE > > > >I figure that if I can remove the noise from that one, I can remove it > >from almost anything. > > > >Notice that it has: > > > > - traffic > > - breathing (wind?) > > - a helicopter > > - a multi-frequency screeching train > > - a crying baby > > - even a tweeting bird! > > > >The good news is that the subject has a deep voice. Additionally, I > >don't care whether the application has to spend hours and hours > >proffering different variations (brute force guesses) until the VR part > >says: > > > >"Now we are talking! -pun intended- That sounds like a human voice!" > > > >Can you please issue an educated guess? Does my idea seem reasonable? > > The problem is that recognizing the presence or absence of a > human voice (or even recognizing the words!) doesn't help > with removing noise. It's not like you can say "aha, human > voice... now I'll just subtract out everything that *isn't* > a human voice." > > Best regards, > > > Bob Masta > > DAQARTA v7.60 > Data AcQuisition And Real-Time Analysis > www.daqarta.com > Scope, Spectrum, Spectrogram, Sound Level Meter > Frequency Counter, Pitch Track, Pitch-to-MIDI > FREE Signal Generator, DaqMusiq generator > Science with your sound card!
There are Guassian Mixture Models (GMM) based approaches to remove noise of various types from speech. They work by using a code-book of speech and noise characterisation. When they find the nearest match they apply a Wiener filter frame by frame - variations of http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4518754&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4518754
On 1/14/2015 7:34 AM, Bob Masta wrote:
> The problem is that recognizing the presence or absence of a > human voice (or even recognizing the words!) doesn't help > with removing noise. It's not like you can say "aha, human > voice... now I'll just subtract out everything that *isn't* > a human voice." > > Best regards, > > > Bob Masta >
I respectfully beg to differ, Bob. Watch this Course: ==================================================================== http://www.macprovideo.com/tutorial/izotope-rx4-audio-repair-toolbox-2 (Free registration) Chapter 39. Deconstructing Tone & Noise The instructor claims: "It is as simple as it is amazing". Watch the introduction as well: http://play.macprovideo.com/izotope-rx4-audio-repair-toolbox-2/1 Jeff chose the analogy of salt getting into the food that he is cooking. That problem can *theoretically* be solved! If we can take one molecule at a time, we can leave the NaCl molecules out, and the rest of the delicious the gumbo in. ==================================================================== In this particular musical case the problem is not too hard: a pure-note instrument (piano) PLUS (it is indeed an arithmetic addition) some low frequency hum. Still: Sinusoids are added and subtracted. I am sure that in the not too distant future we will be able to take the provided Dealey Plaza audio, inform the system: - There is a helicopter, which sounds like this [...] - There is a screeching train which sounds like this [...] - There is a crying baby who sounds like this [...] - Finally, there is the voice in which I am interested. [...] My initial mistake was assuming that all "noise" was garbage that could be thrown out. That was a HUGE mistake! Any "atom" of helicopter rumble helps to identify and subtract all its sister helicopter sounds. Every unit of sound there was placed by some source (IOW: God did not add some random noise, just to mess with us). All the system has to do is pick up every "atom" -one at a time- and figure out (yes, it is not quite easy) in which "plate" it belongs. The number of "plates" is finite and small. At the end, the user will have a multi-channel file, nicely separated. In my case, I would throw away the n-1 channels, leaving the tourist guide fellow's voice. -Ramon