DSPRelated.com
Forums

Clean speech wav files

Started by doggie April 3, 2006
Randy Yates wrote:
> "doggie" <elusivetruelove2003@yahoo.com> writes: > >>[...] >>http://www.ergonomics4schools.com/lzone/noise.htm > > > This site states: > > The human voice produces frequencies between 500Hz and 2,000Hz. > > I would say this is just plain wrong. As I stated before, the minimum > that is accepted by industry is 300 Hz to 3400 Hz. > > This is one of the downsides of the internet - there is bad > information out there as well as good.
The human voice produces frequencies between 800 and 1200 Hz. Of course, it produces other frequencies too. The site isn't wrong, but it sure is misleading. When I see crap like that in a book, I tend to suspect everything else in it. It takes a lot of redeeming value to overcome that prejudice. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
"Phone systems (namely, landlines, and the GSM, AMPS,
and DAMPS cell systems) typically use a bandwidth of 300 Hz to 3400
Hz. "

The GSM Enhanced Full Rate (EFR) codec specifies a high-pass filter at
80Hz.

The original GSM Half Rate and standard Full Rate codecs have a
high-pass filter, but following the specification gives a broken filter
which has very little attenuation at dc.

This can have very bad consequences for IVR systems when the silences
between the prompts contain a steady dc level.

While you might think this is uncommon, it actually happens a lot when
systems designed in USA or Canada are converted for use in Europe.  The
prompts get changed, but nobody thinks to re-record the silences so
u-law silence gets played out into an A-law system.  The dc level is
about 30dB below clipping on an ISDN system.  Combined with the fact
that some echo cancellers treat dc as a perfectly valid continuous tone
it forces them into a doubletalk condition where the silence may be
given priority over the speech that immediately follows it.  This can
have a bad effect on digit recognition where the first digit of a
spoken telephone number is sometimes lost.

It is also worth remembering that if both ends of the call are
terminated with ISDN connections to a computer or digital switch the
bandwidth of the channel will be DC to 4kHz under most conditions.  (I
have found this experimentally between countries in Europe and even on
some transatlantic calls.)

John

>"Phone systems (namely, landlines, and the GSM, AMPS, >and DAMPS cell systems) typically use a bandwidth of 300 Hz to 3400 >Hz. " > >The GSM Enhanced Full Rate (EFR) codec specifies a high-pass filter at >80Hz. > >The original GSM Half Rate and standard Full Rate codecs have a >high-pass filter, but following the specification gives a broken filter >which has very little attenuation at dc. > >This can have very bad consequences for IVR systems when the silences >between the prompts contain a steady dc level. > >While you might think this is uncommon, it actually happens a lot when >systems designed in USA or Canada are converted for use in Europe. The >prompts get changed, but nobody thinks to re-record the silences so >u-law silence gets played out into an A-law system. The dc level is >about 30dB below clipping on an ISDN system. Combined with the fact >that some echo cancellers treat dc as a perfectly valid continuous tone >it forces them into a doubletalk condition where the silence may be >given priority over the speech that immediately follows it. This can >have a bad effect on digit recognition where the first digit of a >spoken telephone number is sometimes lost. > >It is also worth remembering that if both ends of the call are >terminated with ISDN connections to a computer or digital switch the >bandwidth of the channel will be DC to 4kHz under most conditions. (I >have found this experimentally between countries in Europe and even on >some transatlantic calls.) > >John > >
Hmm.. then do you suggest i remove the DC component by doing a high pass filter?
>doggie wrote: > > ... > >> And it seems i can only filter out 0~20Hz. > >How does that follow? There's very little energy below 80 Hz even in a >basso profundo's lowest notes. Removing 50 or 60 Hz (depending on where >one lives) is often salutary. In general, removing those bands with the >lowest SNR is helpful. Listening to noise is tiring. Think of the fan in
>your kitchen. > >Jerry >-- >Engineering is the art of making what you want from things you can get. >&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; >
ok.. 1)meaning i can only improve by filtering with a high pass filter with cutoff after 60hz? 2)Hmm..most of the wave files i've got are sampled at 8kHz, which i guess all the information above 4kHz has been filtered away,formants and stuff like that..Am i right? 3)As i am using mostly energy based detection algorithms, i don't think i can filter out any portion that may contain speech, else,their energy may drop and affect my algorithm..I guess maybe i will try a high pass filter to filter out some noise at the 0~50Hz end.. 4)My clean speech is -28db as calculated by the formula given by Randy previously.so to get a SNR of 10db,i add noise using wgn(length(cleanspeech),1,-38). Did i do anything wrong? So far,the result for 10 and 5db SNR is still acceptable except for some part where noise is being detected as speech which i intend to try to get rid by adding zero crossing detection or stuff like that..Or maybe do a spectral subtraction before detection but it may be too computational heavy..i'll have to look further into this.. Thanks
"Hmm.. then do you suggest i  remove the DC component by doing a high
pass
filter?"

It depends.  You may never see dc offsets large enough to cause you a
problem.  If there is an analogue stage in your recording chain, then
any offset is locally generated. You will not see those offsets coming
in from elsewhere because there is sure to be a transformer or coupling
capacitor in the path.  Normal telephone calls will be unlikely to have
this problem as the codecs at the exchange (central office) or built in
to a digital telephone will probably be good enough.

It is really only fully digital connections to IVR systems (and
possibly some mobile networks) that are likely to suffer.  Maybe also
voip systems as they all need echo cancellers.  Even then the problem
is technically easy to fix once the system operators know about it.
(But it may be too much trouble...)

I mention the problem mainly because it can be so difficult to track
down if you are not expecting it.  There is nothing visible on an
analogue phone line except maybe a tiny spike which appears far to
small to be of any importance.

I spent months trying to track down why a particular IVR system was
performing badly - and this was one of the reasons.

If you think it could affect you, put in a filter.  It doesn't even
need to be a very good one .  -20dB at dc, 1st order hp would be
enough.  A corner frequency of 80 Hz has a good feel to it.  Most
telecomms codecs have built-in notches at 50 and 60Hz, so you don't
need to worry about those frequencies.  Their harmonics on the other
hand can be a problem.

If you are having problems measuring the noise floor correctly and
maybe not segmenting speech accurately then bear this possible cause in
mind.

John

jrwalliker@gmail.com writes:

> "Hmm.. then do you suggest i remove the DC component by doing a high > pass > filter?" > > It depends. > [...]
John, I think you're missing some of doggie's context. He is adding white noise digitally to his signal to simulate SNR. This necessarly means there will be signal energy in all bands. -- % Randy Yates % "Though you ride on the wheels of tomorrow, %% Fuquay-Varina, NC % you still wander the fields of your %%% 919-577-9882 % sorrow." %%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO http://home.earthlink.net/~yatescr
>jrwalliker@gmail.com writes: > >> "Hmm.. then do you suggest i remove the DC component by doing a high >> pass >> filter?" >> >> It depends. > [...] > >John, > >I think you're missing some of doggie's context. He is adding white >noise digitally to his signal to simulate SNR. This necessarly means >there will be signal energy in all bands. >-- >% Randy Yates % "Though you ride on the wheels of
tomorrow,
>%% Fuquay-Varina, NC % you still wander the fields of your >%%% 919-577-9882 % sorrow." >%%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO >http://home.earthlink.net/~yatescr >
Well,actually i'm not sure what i'm doing is correct. basically, i have a wav file of clean speech. So i add noise to it (Should i be adding wgn??). and do a detection at various SNR to test the robustness of my algorithms. After which i will compare the results with the hand marked speech frames using the clean speech. That's basically what i'm trying to do. Please correct me if i did something wrong. I don't know the nature of recording of the clean speech but i'll try to filter away the DC component just in case and see if it gives a better result.
"doggie" <elusivetruelove2003@yahoo.com> writes:

>>jrwalliker@gmail.com writes: >> >>> "Hmm.. then do you suggest i remove the DC component by doing a high >>> pass >>> filter?" >>> >>> It depends. > [...] >> >>John, >> >>I think you're missing some of doggie's context. He is adding white >>noise digitally to his signal to simulate SNR. This necessarly means >>there will be signal energy in all bands. >>-- >>% Randy Yates % "Though you ride on the wheels of > tomorrow, >>%% Fuquay-Varina, NC % you still wander the fields of your >>%%% 919-577-9882 % sorrow." >>%%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO >>http://home.earthlink.net/~yatescr >> > > Well,actually i'm not sure what i'm doing is correct. basically, i have a > wav file of clean speech. So i add noise to it (Should i be adding wgn??). > > and do a detection at various SNR to test the robustness of my > algorithms. > After which i will compare the results with the hand marked speech frames > using the clean speech. That's basically what i'm trying to do. Please > correct me if i did something wrong. I don't know the nature of recording > of the clean speech but i'll try to filter away the DC component just in > case and see if it gives a better result.
That sounds like a good start to me. However, wgn() is the "easiest" kind of noise you could add (i.e., "white"). If you wanted to further characterize your algorithm, you might try colored noises, as well as trying to inject interferers (an interferer is a component that is deterministic or near-deterministic but not part of the signal). I don't know what your application is, but you could try injecting power supply (60 Hz) harmonics, bumblebee noise (Google), road noise, etc. -- % Randy Yates % "With time with what you've learned, %% Fuquay-Varina, NC % they'll kiss the ground you walk %%% 919-577-9882 % upon." %%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO http://home.earthlink.net/~yatescr
>That sounds like a good start to me. However, wgn() is the "easiest" >kind of noise you could add (i.e., "white"). If you wanted to further >characterize your algorithm, you might try colored noises, as well as >trying to inject interferers (an interferer is a component that is >deterministic or near-deterministic but not part of the signal). I >don't know what your application is, but you could try injecting power >supply (60 Hz) harmonics, bumblebee noise (Google), road noise, etc. >-- >% Randy Yates % "With time with what you've learned, >%% Fuquay-Varina, NC % they'll kiss the ground you walk >%%% 919-577-9882 % upon." >%%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO >http://home.earthlink.net/~yatescr >
Yes..i'm going to try out speech recorded at the cafeteria and in a car and more if thats what you mean by interferers. Thanks. Anybody care to comment on my other questions? Sorry if i'm asking too much. As far as possible,i have tried to implement myself with my beginner knowledge. Thanks again. 1)Hmm..most of the wave files i've got are sampled at 8kHz, which i guess all the information above 4kHz has been filtered away,formants and stuff like that..Is it correct to say this? 2)As i am using mostly energy based detection algorithms, i don't think i can filter out any portion that may contain speech, else,their energy may drop and affect my algorithm..I tried using some high pass filters and the noise level is even higher than if i had not do the high pass filtering. Why is this so? So i'm actually making things worse by doing preemphasise. 3)My clean speech is -28db as calculated by the formula given by Randy previously.so to get a SNR of 10db,i add noise using wgn(length(cleanspeech),1,-38). Did i do anything wrong? 4) So far,there are still for some noise categorized as speech. so i'm thinking of doing some other processing to improve it.Maybe adding zero crossing rate criteria or doing a spectral subtraction(SS) before detection.I've found a paper that does SS->detection->SS to get clean speech.i think i'll try it out. At what SNR and it is still able to detect the speech frames can we consider it an acceptable & good algorithm? Maybe anyone can share their opinion? Thanks a lot.
My algorithm is able to detect speech fairly well when i use wav files
sampled at 8kHz, though there are some parts where noise is assume as
speech. But when i use wav files sampled at 16 khz, some speech frame are
not detected though the amplitude is quite high. Why is this so?

Another thing is when i add noise wgn to get a SNR of 5 db or even -5db,i
can still detect the speech frames, though some parts speech is assume as
noise. However, at an SNR of -5db, i should not be able to detect much
speech frame accurately at all isn't it? Is there soething wrong in my
algorithm or what? Pls advice.

Thanks