Randy Yates wrote:> "doggie" <elusivetruelove2003@yahoo.com> writes: > >>[...] >>http://www.ergonomics4schools.com/lzone/noise.htm > > > This site states: > > The human voice produces frequencies between 500Hz and 2,000Hz. > > I would say this is just plain wrong. As I stated before, the minimum > that is accepted by industry is 300 Hz to 3400 Hz. > > This is one of the downsides of the internet - there is bad > information out there as well as good.The human voice produces frequencies between 800 and 1200 Hz. Of course, it produces other frequencies too. The site isn't wrong, but it sure is misleading. When I see crap like that in a book, I tend to suspect everything else in it. It takes a lot of redeeming value to overcome that prejudice. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
Clean speech wav files
Started by ●April 3, 2006
Reply by ●April 4, 20062006-04-04
Reply by ●April 4, 20062006-04-04
"Phone systems (namely, landlines, and the GSM, AMPS, and DAMPS cell systems) typically use a bandwidth of 300 Hz to 3400 Hz. " The GSM Enhanced Full Rate (EFR) codec specifies a high-pass filter at 80Hz. The original GSM Half Rate and standard Full Rate codecs have a high-pass filter, but following the specification gives a broken filter which has very little attenuation at dc. This can have very bad consequences for IVR systems when the silences between the prompts contain a steady dc level. While you might think this is uncommon, it actually happens a lot when systems designed in USA or Canada are converted for use in Europe. The prompts get changed, but nobody thinks to re-record the silences so u-law silence gets played out into an A-law system. The dc level is about 30dB below clipping on an ISDN system. Combined with the fact that some echo cancellers treat dc as a perfectly valid continuous tone it forces them into a doubletalk condition where the silence may be given priority over the speech that immediately follows it. This can have a bad effect on digit recognition where the first digit of a spoken telephone number is sometimes lost. It is also worth remembering that if both ends of the call are terminated with ISDN connections to a computer or digital switch the bandwidth of the channel will be DC to 4kHz under most conditions. (I have found this experimentally between countries in Europe and even on some transatlantic calls.) John
Reply by ●April 4, 20062006-04-04
>"Phone systems (namely, landlines, and the GSM, AMPS, >and DAMPS cell systems) typically use a bandwidth of 300 Hz to 3400 >Hz. " > >The GSM Enhanced Full Rate (EFR) codec specifies a high-pass filter at >80Hz. > >The original GSM Half Rate and standard Full Rate codecs have a >high-pass filter, but following the specification gives a broken filter >which has very little attenuation at dc. > >This can have very bad consequences for IVR systems when the silences >between the prompts contain a steady dc level. > >While you might think this is uncommon, it actually happens a lot when >systems designed in USA or Canada are converted for use in Europe. The >prompts get changed, but nobody thinks to re-record the silences so >u-law silence gets played out into an A-law system. The dc level is >about 30dB below clipping on an ISDN system. Combined with the fact >that some echo cancellers treat dc as a perfectly valid continuous tone >it forces them into a doubletalk condition where the silence may be >given priority over the speech that immediately follows it. This can >have a bad effect on digit recognition where the first digit of a >spoken telephone number is sometimes lost. > >It is also worth remembering that if both ends of the call are >terminated with ISDN connections to a computer or digital switch the >bandwidth of the channel will be DC to 4kHz under most conditions. (I >have found this experimentally between countries in Europe and even on >some transatlantic calls.) > >John > >Hmm.. then do you suggest i remove the DC component by doing a high pass filter?
Reply by ●April 4, 20062006-04-04
>doggie wrote: > > ... > >> And it seems i can only filter out 0~20Hz. > >How does that follow? There's very little energy below 80 Hz even in a >basso profundo's lowest notes. Removing 50 or 60 Hz (depending on where >one lives) is often salutary. In general, removing those bands with the >lowest SNR is helpful. Listening to noise is tiring. Think of the fan in>your kitchen. > >Jerry >-- >Engineering is the art of making what you want from things you can get. >����������������������������������������������������������������������� >ok.. 1)meaning i can only improve by filtering with a high pass filter with cutoff after 60hz? 2)Hmm..most of the wave files i've got are sampled at 8kHz, which i guess all the information above 4kHz has been filtered away,formants and stuff like that..Am i right? 3)As i am using mostly energy based detection algorithms, i don't think i can filter out any portion that may contain speech, else,their energy may drop and affect my algorithm..I guess maybe i will try a high pass filter to filter out some noise at the 0~50Hz end.. 4)My clean speech is -28db as calculated by the formula given by Randy previously.so to get a SNR of 10db,i add noise using wgn(length(cleanspeech),1,-38). Did i do anything wrong? So far,the result for 10 and 5db SNR is still acceptable except for some part where noise is being detected as speech which i intend to try to get rid by adding zero crossing detection or stuff like that..Or maybe do a spectral subtraction before detection but it may be too computational heavy..i'll have to look further into this.. Thanks
Reply by ●April 4, 20062006-04-04
"Hmm.. then do you suggest i remove the DC component by doing a high pass filter?" It depends. You may never see dc offsets large enough to cause you a problem. If there is an analogue stage in your recording chain, then any offset is locally generated. You will not see those offsets coming in from elsewhere because there is sure to be a transformer or coupling capacitor in the path. Normal telephone calls will be unlikely to have this problem as the codecs at the exchange (central office) or built in to a digital telephone will probably be good enough. It is really only fully digital connections to IVR systems (and possibly some mobile networks) that are likely to suffer. Maybe also voip systems as they all need echo cancellers. Even then the problem is technically easy to fix once the system operators know about it. (But it may be too much trouble...) I mention the problem mainly because it can be so difficult to track down if you are not expecting it. There is nothing visible on an analogue phone line except maybe a tiny spike which appears far to small to be of any importance. I spent months trying to track down why a particular IVR system was performing badly - and this was one of the reasons. If you think it could affect you, put in a filter. It doesn't even need to be a very good one . -20dB at dc, 1st order hp would be enough. A corner frequency of 80 Hz has a good feel to it. Most telecomms codecs have built-in notches at 50 and 60Hz, so you don't need to worry about those frequencies. Their harmonics on the other hand can be a problem. If you are having problems measuring the noise floor correctly and maybe not segmenting speech accurately then bear this possible cause in mind. John
Reply by ●April 4, 20062006-04-04
jrwalliker@gmail.com writes:> "Hmm.. then do you suggest i remove the DC component by doing a high > pass > filter?" > > It depends. > [...]John, I think you're missing some of doggie's context. He is adding white noise digitally to his signal to simulate SNR. This necessarly means there will be signal energy in all bands. -- % Randy Yates % "Though you ride on the wheels of tomorrow, %% Fuquay-Varina, NC % you still wander the fields of your %%% 919-577-9882 % sorrow." %%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO http://home.earthlink.net/~yatescr
Reply by ●April 4, 20062006-04-04
>jrwalliker@gmail.com writes: > >> "Hmm.. then do you suggest i remove the DC component by doing a high >> pass >> filter?" >> >> It depends. > [...] > >John, > >I think you're missing some of doggie's context. He is adding white >noise digitally to his signal to simulate SNR. This necessarly means >there will be signal energy in all bands. >-- >% Randy Yates % "Though you ride on the wheels oftomorrow,>%% Fuquay-Varina, NC % you still wander the fields of your >%%% 919-577-9882 % sorrow." >%%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO >http://home.earthlink.net/~yatescr >Well,actually i'm not sure what i'm doing is correct. basically, i have a wav file of clean speech. So i add noise to it (Should i be adding wgn??). and do a detection at various SNR to test the robustness of my algorithms. After which i will compare the results with the hand marked speech frames using the clean speech. That's basically what i'm trying to do. Please correct me if i did something wrong. I don't know the nature of recording of the clean speech but i'll try to filter away the DC component just in case and see if it gives a better result.
Reply by ●April 4, 20062006-04-04
"doggie" <elusivetruelove2003@yahoo.com> writes:>>jrwalliker@gmail.com writes: >> >>> "Hmm.. then do you suggest i remove the DC component by doing a high >>> pass >>> filter?" >>> >>> It depends. > [...] >> >>John, >> >>I think you're missing some of doggie's context. He is adding white >>noise digitally to his signal to simulate SNR. This necessarly means >>there will be signal energy in all bands. >>-- >>% Randy Yates % "Though you ride on the wheels of > tomorrow, >>%% Fuquay-Varina, NC % you still wander the fields of your >>%%% 919-577-9882 % sorrow." >>%%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO >>http://home.earthlink.net/~yatescr >> > > Well,actually i'm not sure what i'm doing is correct. basically, i have a > wav file of clean speech. So i add noise to it (Should i be adding wgn??). > > and do a detection at various SNR to test the robustness of my > algorithms. > After which i will compare the results with the hand marked speech frames > using the clean speech. That's basically what i'm trying to do. Please > correct me if i did something wrong. I don't know the nature of recording > of the clean speech but i'll try to filter away the DC component just in > case and see if it gives a better result.That sounds like a good start to me. However, wgn() is the "easiest" kind of noise you could add (i.e., "white"). If you wanted to further characterize your algorithm, you might try colored noises, as well as trying to inject interferers (an interferer is a component that is deterministic or near-deterministic but not part of the signal). I don't know what your application is, but you could try injecting power supply (60 Hz) harmonics, bumblebee noise (Google), road noise, etc. -- % Randy Yates % "With time with what you've learned, %% Fuquay-Varina, NC % they'll kiss the ground you walk %%% 919-577-9882 % upon." %%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO http://home.earthlink.net/~yatescr
Reply by ●April 5, 20062006-04-05
>That sounds like a good start to me. However, wgn() is the "easiest" >kind of noise you could add (i.e., "white"). If you wanted to further >characterize your algorithm, you might try colored noises, as well as >trying to inject interferers (an interferer is a component that is >deterministic or near-deterministic but not part of the signal). I >don't know what your application is, but you could try injecting power >supply (60 Hz) harmonics, bumblebee noise (Google), road noise, etc. >-- >% Randy Yates % "With time with what you've learned, >%% Fuquay-Varina, NC % they'll kiss the ground you walk >%%% 919-577-9882 % upon." >%%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO >http://home.earthlink.net/~yatescr >Yes..i'm going to try out speech recorded at the cafeteria and in a car and more if thats what you mean by interferers. Thanks. Anybody care to comment on my other questions? Sorry if i'm asking too much. As far as possible,i have tried to implement myself with my beginner knowledge. Thanks again. 1)Hmm..most of the wave files i've got are sampled at 8kHz, which i guess all the information above 4kHz has been filtered away,formants and stuff like that..Is it correct to say this? 2)As i am using mostly energy based detection algorithms, i don't think i can filter out any portion that may contain speech, else,their energy may drop and affect my algorithm..I tried using some high pass filters and the noise level is even higher than if i had not do the high pass filtering. Why is this so? So i'm actually making things worse by doing preemphasise. 3)My clean speech is -28db as calculated by the formula given by Randy previously.so to get a SNR of 10db,i add noise using wgn(length(cleanspeech),1,-38). Did i do anything wrong? 4) So far,there are still for some noise categorized as speech. so i'm thinking of doing some other processing to improve it.Maybe adding zero crossing rate criteria or doing a spectral subtraction(SS) before detection.I've found a paper that does SS->detection->SS to get clean speech.i think i'll try it out. At what SNR and it is still able to detect the speech frames can we consider it an acceptable & good algorithm? Maybe anyone can share their opinion? Thanks a lot.
Reply by ●April 6, 20062006-04-06
My algorithm is able to detect speech fairly well when i use wav files sampled at 8kHz, though there are some parts where noise is assume as speech. But when i use wav files sampled at 16 khz, some speech frame are not detected though the amplitude is quite high. Why is this so? Another thing is when i add noise wgn to get a SNR of 5 db or even -5db,i can still detect the speech frames, though some parts speech is assume as noise. However, at an SNR of -5db, i should not be able to detect much speech frame accurately at all isn't it? Is there soething wrong in my algorithm or what? Pls advice. Thanks






