comp.dsp | The Human Voice has been widely characterized, correct?| page 2

Reply by Bob Masta ●January 12, 20152015-01-12

On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera
<ramon@conexus.net> wrote:

>Let me ask the question in another way:
>
>In addition to having a range from X Hertz to Y Kilohertz:
>
>    http://en.wikipedia.org/wiki/Vocal_range
>
>What other characteristics does our voice have?

Plenty!  Consider that voiced sounds are produced by a
stream of glottal pulses exciting a multiply-resonant
cavity.  The glottal pulse rate and the specific resonant
frequencies (formants) are variable.  If you look at a
speech spectrogram (see
<http://www.daqarta.com/dw_sgram.htm>)
you can see the harmonics of the glottal pulses as stacked
roughly-horizontal streaks (upper image).  

The streaks are horizontal during prolonged vowels, but they
swoop up or down on the consonants.  Some may move up while
others move down... that's what distinguishes 'ba' from
'da', for example.

The lower image on that page shows the same utterance at
higher time resolution, but with reduced frequency
resolution, such that the horizontal harmonic streaks are
blurred together, but the individual glottal pulses are
clearly seen.

The vertical streaks at the start of each syllable in both
images are noise bursts, which are also shaped by the
individual formant frequencies of the resonant filters.  The
example utterance doesn't include sibilants like "s", but
those would be prolonged noise bursts.  (A whisper is
essentially broadband noise that has passed through the
formant filters... no glottal pulses or harmonics.)

If your software can figure out where each syllable starts
and stops, it can erase everything between syllables.  To
remove noise *during* the syllables, it would need to figure
out which noise is "speech noise" and which is unwanted
noise. That might be done during voiced sounds by erasing
everything that's not near a harmonic of the glottal pulse.
For the unvoiced stuff,  you'd need to figure out the
formant frequencies and leave the noise near those, but
remove it elsewhere.  But you don't want to remove the
entire vertical streaks.

After you figure out how to do all that, the singing voice
should be a piece of cake!

Best regards,

Bob Masta

              DAQARTA  v7.60
   Data AcQuisition And Real-Time Analysis
              www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
 Frequency Counter, Pitch Track, Pitch-to-MIDI 
   FREE Signal Generator, DaqMusiq generator    
          Science with your sound card!

Reply by Ramon F Herrera ●January 12, 20152015-01-12

On 1/12/2015 8:36 AM, Bob Masta wrote:
> On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera
> <ramon@conexus.net> wrote:
>
>> Let me ask the question in another way:
>>
>> In addition to having a range from X Hertz to Y Kilohertz:
>>
>>     http://en.wikipedia.org/wiki/Vocal_range
>>
>> What other characteristics does our voice have?
>
> Plenty!  Consider that voiced sounds are produced by a
> stream of glottal pulses exciting a multiply-resonant
> cavity.  The glottal pulse rate and the specific resonant
> frequencies (formants) are variable.  If you look at a
> speech spectrogram (see
> <http://www.daqarta.com/dw_sgram.htm>)
> you can see the harmonics of the glottal pulses as stacked
> roughly-horizontal streaks (upper image).

Thanks for your kind assistance, Bob.

What puzzles me is the demarcation line between:

(a) This is a human-produced sound (ie, voice)

(b) This one is not. The human vocal system cannot possibly phonate this 
utterance.

In addition to the obvious frequency range, I came up with 2 possible 
parameters: First, the fact that we breath and therefore have to take 
pauses. Alas, this is a rather weak hook/discriminator, since some 
speakers may breath during speech (see ventriloquists, etc).

This is the interesting one: If you play a voice segment backwards you 
will end up with *exactly* the same spectral footprint as the original 
but with a sound that belongs in category (b) above.

There are some mini-explosions in speech which are irreversible.

-Ramon

Reply by Ramon F Herrera ●January 12, 20152015-01-12

On 1/12/2015 8:36 AM, Bob Masta wrote:
> After you figure out how to do all that, the singing voice
> should be a piece of cake!

Me !!??  Why me??? You guys are the comp.dsp geniuses. You and all those 
professors, researchers and book authors.

As far as I know, so far, the total extent of material publicly 
available is some TIMIT (6300 short sound clips) database.  Is 
everything else locked up in Apple's and Dragon's software safe?

Some enterprising researcher who comes up with a model (preferably with 
source code!) of:

  - What is a human voice
  - What is not

Should be admired, lavishly feted, pursued by Hollywood starlets, etc.

You are in essence telling me to replicate all the work that is probably 
out there.

BTW: I downloaded your Daqarta and am playing with it...

Regards,

-Ramon

"We stood on the shoulders of giants".

Reply by ●January 13, 20152015-01-13

On Monday, January 12, 2015 at 5:27:03 PM UTC+13, Ramon F Herrera wrote:
> On 1/11/2015 2:18 PM, gyansorova@gmail.com wrote:
> > Normally we use classification methods eg SVM  - a bit like speech recognition. You need of course a data-base or Corpus to train your system. There are a number of such Corpuses online eg TIMIT.
> 
> I found this one:
> 
>      https://catalog.ldc.upenn.edu/LDC93S1
> 
> At $250 it is too rich for my blood. Even if I had the money, do I 
> really need that?
> 
> In theory, what I need is the UNION (in the set theory sense) of all the 
> speeches ever uttered by human beings since the vocal chords were 
> developed. You guys will notice that, being a lazy guy, I am taking the 
> opposite (easy) route.
> 
> That UNION is my universe of sounds that will be considered positive by 
> my system. The rest (Boolean complement) is negative, ergo noise and do 
> not belong in the audio clipping that I am trying to clean.
> 
> Let me ask the question in another way:
> 
> In addition to having a range from X Hertz to Y Kilohertz:
> 
>     http://en.wikipedia.org/wiki/Vocal_range
> 
> What other characteristics does our voice have?
> 
> TIA,
> 
> -Ramon

Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use.

Reply by ●January 13, 20152015-01-13

Den mandag den 12. januar 2015 kl. 10.27.44 UTC+1 skrev Mac Decman:
> On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera
> <ramon@conexus.net> wrote:
> 
> <snip>
> >
> >In addition to having a range from X Hertz to Y Kilohertz:
> >
> >    http://en.wikipedia.org/wiki/Vocal_range
> >
> >What other characteristics does our voice have?
> >
> >TIA,
> >
> >-Ramon
> >
> >
> 
> Hah, this is too funny.
> 
> Mark

http://youtu.be/5hyI_dM5cGo

-Lasse

Reply by Ramon F Herrera ●January 13, 20152015-01-13

On 1/13/2015 3:31 PM, gyansorova@gmail.com wrote:
> Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use.
>

Thanks again, gyansorova...

For a while, I have been looking all over the place and I cannot get an
answer to my curiosity. My interest is not exactly voice recognition per
se, but highly related.

I have been reading (glancing) about different VR systems, and one of
their main desirable features is how they remain working in the presence
of NOISE.

What I need is the *opposite*. I would like to tune a VR system in the
other direction, so it breaks quickly and proclaims:

"That is NOT a human voice".

So it would be an expert at *Noise Recognition*.

Such feature would be at the core of a Noise Removal System.

This is the target audio clip that I have in mind:

https://www.youtube.com/watch?v=a-EAqNbpcYE

I figure that if I can remove the noise from that one, I can remove it
from almost anything.

Notice that it has:

- traffic
- breathing (wind?)
- a helicopter
- a multi-frequency screeching train
- a crying baby
- even a tweeting bird!

The good news is that the subject has a deep voice. Additionally, I
don't care whether the application has to spend hours and hours
proffering different variations (brute force guesses) until the VR part
says:

"Now we are talking! -pun intended- That sounds like a human voice!"

Can you please issue an educated guess? Does my idea seem reasonable?

TIA,

-Ramon

Reply by Eric Jacobsen ●January 13, 20152015-01-13

On Tue, 13 Jan 2015 13:52:35 -0800 (PST), langwadt@fonz.dk wrote:

>Den mandag den 12. januar 2015 kl. 10.27.44 UTC+1 skrev Mac Decman:
>> On Sun, 11 Jan 2015 22:26:59 -0600, Ramon F Herrera
>> <ramon@conexus.net> wrote:
>> 
>> <snip>
>> >
>> >In addition to having a range from X Hertz to Y Kilohertz:
>> >
>> >    http://en.wikipedia.org/wiki/Vocal_range
>> >
>> >What other characteristics does our voice have?
>> >
>> >TIA,
>> >
>> >-Ramon
>> >
>> >
>> 
>> Hah, this is too funny.
>> 
>> Mark
>
>http://youtu.be/5hyI_dM5cGo
>
>-Lasse


Nice!


Eric Jacobsen
Anchor Hill Communications
http://www.anchorhill.com

Reply by Bob Masta ●January 14, 20152015-01-14

On Tue, 13 Jan 2015 16:15:25 -0600, Ramon F Herrera
<ramon@patriot.net> wrote:

>On 1/13/2015 3:31 PM, gyansorova@gmail.com wrote:
>> Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use.
>>
>
>
>Thanks again, gyansorova...
>
>For a while, I have been looking all over the place and I cannot get an 
>answer to my curiosity. My interest is not exactly voice recognition per 
>se, but highly related.
>
>I have been reading (glancing) about different VR systems, and one of 
>their main desirable features is how they remain working in the presence 
>of NOISE.
>
>What I need is the *opposite*. I would like to tune a VR system in the 
>other direction, so it breaks quickly and proclaims:
>
>"That is NOT a human voice".
>
>So it would be an expert at *Noise Recognition*.
>
>Such feature would be at the core of a Noise Removal System.
>
>This is the target audio clip that I have in mind:
>
>   https://www.youtube.com/watch?v=a-EAqNbpcYE
>
>I figure that if I can remove the noise from that one, I can remove it 
>from almost anything.
>
>Notice that it has:
>
>  - traffic
>  - breathing (wind?)
>  - a helicopter
>  - a multi-frequency screeching train
>  - a crying baby
>  - even a tweeting bird!
>
>The good news is that the subject has a deep voice. Additionally, I 
>don't care whether the application has to spend hours and hours 
>proffering different variations (brute force guesses) until the VR part 
>says:
>
>"Now we are talking! -pun intended- That sounds like a human voice!"
>
>Can you please issue an educated guess? Does my idea seem reasonable?

The problem is that recognizing the presence or absence of a
human voice (or even recognizing the words!) doesn't help
with removing noise.   It's not like you can say "aha, human
voice... now I'll just subtract out everything that *isn't*
a human voice."

Best regards,


Bob Masta
 
              DAQARTA  v7.60
   Data AcQuisition And Real-Time Analysis
              www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
 Frequency Counter, Pitch Track, Pitch-to-MIDI 
   FREE Signal Generator, DaqMusiq generator    
          Science with your sound card!

Reply by ●January 14, 20152015-01-14

On Thursday, January 15, 2015 at 2:34:04 AM UTC+13, Bob Masta wrote:
> On Tue, 13 Jan 2015 16:15:25 -0600, Ramon F Herrera
> <ramon@patriot.net> wrote:
> 
> >On 1/13/2015 3:31 PM, gyansorova@gmail.com wrote:
> >> Not many individuals buy it though Unis use it a lot so that they can compare algorithms. I think it is free for academic use.
> >>
> >
> >
> >Thanks again, gyansorova...
> >
> >For a while, I have been looking all over the place and I cannot get an 
> >answer to my curiosity. My interest is not exactly voice recognition per 
> >se, but highly related.
> >
> >I have been reading (glancing) about different VR systems, and one of 
> >their main desirable features is how they remain working in the presence 
> >of NOISE.
> >
> >What I need is the *opposite*. I would like to tune a VR system in the 
> >other direction, so it breaks quickly and proclaims:
> >
> >"That is NOT a human voice".
> >
> >So it would be an expert at *Noise Recognition*.
> >
> >Such feature would be at the core of a Noise Removal System.
> >
> >This is the target audio clip that I have in mind:
> >
> >   https://www.youtube.com/watch?v=a-EAqNbpcYE
> >
> >I figure that if I can remove the noise from that one, I can remove it 
> >from almost anything.
> >
> >Notice that it has:
> >
> >  - traffic
> >  - breathing (wind?)
> >  - a helicopter
> >  - a multi-frequency screeching train
> >  - a crying baby
> >  - even a tweeting bird!
> >
> >The good news is that the subject has a deep voice. Additionally, I 
> >don't care whether the application has to spend hours and hours 
> >proffering different variations (brute force guesses) until the VR part 
> >says:
> >
> >"Now we are talking! -pun intended- That sounds like a human voice!"
> >
> >Can you please issue an educated guess? Does my idea seem reasonable?
> 
> The problem is that recognizing the presence or absence of a
> human voice (or even recognizing the words!) doesn't help
> with removing noise.   It's not like you can say "aha, human
> voice... now I'll just subtract out everything that *isn't*
> a human voice."
> 
> Best regards,
> 
> 
> Bob Masta
>  
>               DAQARTA  v7.60
>    Data AcQuisition And Real-Time Analysis
>               www.daqarta.com
> Scope, Spectrum, Spectrogram, Sound Level Meter
>  Frequency Counter, Pitch Track, Pitch-to-MIDI 
>    FREE Signal Generator, DaqMusiq generator    
>           Science with your sound card!

There are Guassian Mixture Models (GMM) based approaches to remove noise of various types from speech. They work by using a code-book of speech and noise characterisation. When they find the nearest match they apply a Wiener filter frame by frame - variations of


http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4518754&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4518754

Reply by Ramon F Herrera ●January 14, 20152015-01-14

On 1/14/2015 7:34 AM, Bob Masta wrote:
> The problem is that recognizing the presence or absence of a
> human voice (or even recognizing the words!) doesn't help
> with removing noise.   It's not like you can say "aha, human
> voice... now I'll just subtract out everything that *isn't*
> a human voice."
>
> Best regards,
>
>
> Bob Masta
>

I respectfully beg to differ, Bob. Watch this Course:

====================================================================

http://www.macprovideo.com/tutorial/izotope-rx4-audio-repair-toolbox-2 
(Free registration)

Chapter 39. Deconstructing Tone & Noise

The instructor claims: "It is as simple as it is amazing".

Watch the introduction as well:

http://play.macprovideo.com/izotope-rx4-audio-repair-toolbox-2/1

Jeff chose the analogy of salt getting into the food that he is cooking. 
That problem can *theoretically* be solved! If we can take one molecule 
at a time, we can leave the NaCl molecules out, and the rest of the 
delicious the gumbo in.

====================================================================

In this particular musical case the problem is not too hard: a pure-note 
instrument (piano) PLUS (it is indeed an arithmetic addition) some low 
frequency hum.

Still: Sinusoids are added and subtracted.

I am sure that in the not too distant future we will be able to take the 
provided Dealey Plaza audio, inform the system:

  - There is a helicopter, which sounds like this [...]
  - There is a screeching train which sounds like this [...]
  - There is a crying baby who sounds like this [...]
  - Finally, there is the voice in which I am interested. [...]

My initial mistake was assuming that all "noise" was garbage that could 
be thrown out. That was a HUGE mistake! Any "atom" of helicopter rumble 
helps to identify and subtract all its sister helicopter sounds.

Every unit of sound there was placed by some source (IOW: God did not 
add some random noise, just to mess with us). All the system has to do 
is pick up every "atom" -one at a time- and figure out (yes, it is not 
quite easy) in which "plate" it belongs. The number of "plates" is 
finite and small. At the end, the user will have a multi-channel file, 
nicely separated. In my case, I would throw away the n-1 channels, 
leaving the tourist guide fellow's voice.

-Ramon

Previous 123 4 Next

The Human Voice has been widely characterized, correct?

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group