On Dec 27 2007, 8:38 am, Mandar Gokhale <stallo...@gmail.com> wrote:
> I am aiming to build a very basicspeechrecognition system around an
> 8-bit microcontroller(PIC / AVR), which is capable of 'recognizing'
> four to eight words..(i.e, give a specific string output when it
> receives the corresponding input data through a mic.)
>
> Someone told me that designing Butterworth filters forprocessingthe
> input data and then sampling it at different points is a pretty good
> strategy.... However, all I can find on the Net regarding this is a
> lot of highly obfuscated jargon....So, could anyone please direct me
> to a good, clear source explaining this (or any other)speech
> recogntion algorithm in minute detail?.....
>
> Hope you people can throw some light on this............
>
> Thanks...

Mandar .... yoyo
mostly u are doing this for the iit B project ... so here is a
suggestion
since there are only 4 words to be recognitioned and they are all
distinct
use the DTW approach , it is much simpler to implement on a
microcontroller can be implemented on a psoc as well pretty easily and
u wont require much ram/rom memory or it too ..
use the zero crossings of the speech signal in the time domain to make
the feature vector of the wave then apply dtw to make it time
independent and compare the resultant values :D
neways see u arnd in the campus
BITS - Pilani Goa Campus
\m/ FTW \m/

On Jan 17, 3:05&#4294967295;am, jnarino <jnar...@gmail.com> wrote:
> On Jan 17, 8:07 am, dbell <bellda2...@cox.net> wrote:
>
>
>
>
>
> > On Jan 16, 3:13 am, jnarino <jnar...@gmail.com> wrote:
>
> > > Hi Dirk
> > > When I mean use DTW, I do not actually mean comparing directly the two
> > > waveforms. DTW should be applied to normalize length, and then
> > > cepstral coefficients should be obtained. In my opinion, if you are
> > > intending to use a very limited vocabulary, DTW with cepstral analysis
> > > and neural networks can be an interesting option. Of course there will
> > > be distortion by warping the samples, but for limited vocabulary a
> > > neural network can learn really well. Please correct me if I am wrong.
>
> > > regards
>
> > Juan,
>
> > I am not following what you are saying. &#4294967295;What exactly are you going to
> > apply the DTW to, in order to normalize the lengths of the waveforms,
> > prior to cepstral analysis?
>
> > Dirk
>
> Hi Dirk
>
> You are correct Dirk. First you apply DTW to normalize the lengths of
> the waveforms prior to cepstral analysis.
>
> Juan- Hide quoted text -
>
> - Show quoted text -

Juan,

By actually performing DTW on the waveform samples?

The reason I am questioning this, is that in order to do this, if you
intend what I am asking, it seems that you actually have to find a way
to remove or insert pitch periods if you have any hope of producing
similar waveforms.

So what exactly is it you intend to align from the waveform while
performing the DTW to normalize the waveform lengths?

Dirk

On Jan 17, 8:07 am, dbell <bellda2...@cox.net> wrote:
> On Jan 16, 3:13 am, jnarino <jnar...@gmail.com> wrote:
>
> > Hi Dirk
> > When I mean use DTW, I do not actually mean comparing directly the two
> > waveforms. DTW should be applied to normalize length, and then
> > cepstral coefficients should be obtained. In my opinion, if you are
> > intending to use a very limited vocabulary, DTW with cepstral analysis
> > and neural networks can be an interesting option. Of course there will
> > be distortion by warping the samples, but for limited vocabulary a
> > neural network can learn really well. Please correct me if I am wrong.
>
> > regards
>
> Juan,
>
> I am not following what you are saying.  What exactly are you going to
> apply the DTW to, in order to normalize the lengths of the waveforms,
> prior to cepstral analysis?
>
> Dirk

Hi Dirk

You are correct Dirk. First you apply DTW to normalize the lengths of
the waveforms prior to cepstral analysis.

Juan

On Jan 16, 3:13&#4294967295;am, jnarino <jnar...@gmail.com> wrote:
> Hi Dirk
> When I mean use DTW, I do not actually mean comparing directly the two
> waveforms. DTW should be applied to normalize length, and then
> cepstral coefficients should be obtained. In my opinion, if you are
> intending to use a very limited vocabulary, DTW with cepstral analysis
> and neural networks can be an interesting option. Of course there will
> be distortion by warping the samples, but for limited vocabulary a
> neural network can learn really well. Please correct me if I am wrong.
>
> regards

Juan,

I am not following what you are saying.  What exactly are you going to
apply the DTW to, in order to normalize the lengths of the waveforms,
prior to cepstral analysis?

Dirk

Hi Dirk
When I mean use DTW, I do not actually mean comparing directly the two
waveforms. DTW should be applied to normalize length, and then
cepstral coefficients should be obtained. In my opinion, if you are
intending to use a very limited vocabulary, DTW with cepstral analysis
and neural networks can be an interesting option. Of course there will
be distortion by warping the samples, but for limited vocabulary a
neural network can learn really well. Please correct me if I am wrong.

regards

On Jan 15, 5:16&#4294967295;am, jnarino <jnar...@gmail.com> wrote:
> On Jan 15, 4:36 am, dbell <bellda2...@cox.net> wrote:
>
>
>
>
>
> > On Jan 14, 3:55 am, jnarino <jnar...@gmail.com> wrote:
>
> > > First, the one who should GFY is Vassily, for being such a rude
> > > ignorant idiot. If you do not know anything aboutspeechrecognition,
> > > just shut up.
>
> > > This question onspeechrecognitionhas many possible answers.
>
> > > Please first define the domain. I will assume you are only trying to
> > > recognize a few words, so you will be doing limited vocabulary
> > >recognition. In this case, your best bet forSpeechrecognitionwould
> > > be neural networks. You train the neural network with a few samples of
> > > the intended words.
>
> > > However, it is not that simple and I will explain why. First, as
> > > somebody already said, you should do DTW (dynamic time warping), to
> > > normalize the length of the utterance. Aftwerwards, you should do
> > > cepstral analysis to obtain a feature vector to feed your neural
> > > network. A simple PIC maybe would not suffice.
>
> > > The preprocessing stage, with the filters and such, is just for
> > > increasing robustness and getting rid of the information we are not
> > > interested into.
>
> > > I recommend you to read the HTK Book introductory part to understood
> > > Hidden Markov Models basedspeechrecognition. The book is available
> > > for free (after simple registration) on &#4294967295;http://htk.eng.cam.ac.uk/.
>
> > > Another solution would be looking for those specialized ICs, but have
> > > not tried them and maybe they are not cheap or readily available.
>
> > > So basically your system should consist of this, in this order,
> > > connected in cascade
>
> > > signal adquisition (microphone)
> > > Bandpass filter (can be a butterworth filter) between 100Hz and 4000Hz
> > > (the rest is redundant)
> > > Sampling A/D converter, sampling at least at 8Khz (recommended)
>
> > > Once the signal is into the microprocessor, the first thing you should
> > > do is voice activity detection (VAD). There are some algorithms for
> > > this, please google it.
>
> > > Once you have detected the beginning and the end of a utterance, you
> > > should do Dynamic Time warping to normalize its length, so it can be
> > > compared.
>
> > > Then, do framming and obtain cepstral coefficients.
>
> > > Feed your neural network and wait for the result.
>
> > > Of course, first you will need to train the neural network.
>
> > > If you have more doubts, do not hesitate to ask.
>
> > > Regards
>
> > > Juan Pablo
>
> > Dynamic time warping on the time domain signal? &#4294967295;Have you tried that?
>
> DTW was fairly common in the 70s in the beginning of speech
> recognition research. It is somewhat obsolete for Large Vocabulary
> Continuous Speech Recognition (LVCSR) and has been superseded by
> HMMs, &#4294967295; but still it is used for simple commands like the original
> poster wants. For simple tasks, it is very effective.
>
> Regards
>
> Juan- Hide quoted text -
>
> - Show quoted text -

Juan,

I am actually familiar with DTW.  Do you think that it is appropriate
for an actual waveform as opposed to a sequence of parameters derived
from such a waveform (like formants, ...) ?

Do you think that a 1 second utterance of a word can be meaningfully
brought into waveform alignment with a 1.5 uttererance of the same
word using DTW?

Dirk

On Jan 15, 4:36 am, dbell <bellda2...@cox.net> wrote:
> On Jan 14, 3:55 am, jnarino <jnar...@gmail.com> wrote:
>
>
>
> > First, the one who should GFY is Vassily, for being such a rude
> > ignorant idiot. If you do not know anything aboutspeechrecognition,
> > just shut up.
>
> > This question onspeechrecognitionhas many possible answers.
>
> > Please first define the domain. I will assume you are only trying to
> > recognize a few words, so you will be doing limited vocabulary
> >recognition. In this case, your best bet forSpeechrecognitionwould
> > be neural networks. You train the neural network with a few samples of
> > the intended words.
>
> > However, it is not that simple and I will explain why. First, as
> > somebody already said, you should do DTW (dynamic time warping), to
> > normalize the length of the utterance. Aftwerwards, you should do
> > cepstral analysis to obtain a feature vector to feed your neural
> > network. A simple PIC maybe would not suffice.
>
> > The preprocessing stage, with the filters and such, is just for
> > increasing robustness and getting rid of the information we are not
> > interested into.
>
> > I recommend you to read the HTK Book introductory part to understood
> > Hidden Markov Models basedspeechrecognition. The book is available
> > for free (after simple registration) on  http://htk.eng.cam.ac.uk/.
>
> > Another solution would be looking for those specialized ICs, but have
> > not tried them and maybe they are not cheap or readily available.
>
> > So basically your system should consist of this, in this order,
> > connected in cascade
>
> > signal adquisition (microphone)
> > Bandpass filter (can be a butterworth filter) between 100Hz and 4000Hz
> > (the rest is redundant)
> > Sampling A/D converter, sampling at least at 8Khz (recommended)
>
> > Once the signal is into the microprocessor, the first thing you should
> > do is voice activity detection (VAD). There are some algorithms for
> > this, please google it.
>
> > Once you have detected the beginning and the end of a utterance, you
> > should do Dynamic Time warping to normalize its length, so it can be
> > compared.
>
> > Then, do framming and obtain cepstral coefficients.
>
> > Feed your neural network and wait for the result.
>
> > Of course, first you will need to train the neural network.
>
> > If you have more doubts, do not hesitate to ask.
>
> > Regards
>
> > Juan Pablo
>
> Dynamic time warping on the time domain signal?  Have you tried that?
>

DTW was fairly common in the 70s in the beginning of speech
recognition research. It is somewhat obsolete for Large Vocabulary
Continuous Speech Recognition (LVCSR) and has been superseded by
HMMs,   but still it is used for simple commands like the original
poster wants. For simple tasks, it is very effective.

Regards

Juan

On Jan 14, 3:55&#4294967295;am, jnarino <jnar...@gmail.com> wrote:
> First, the one who should GFY is Vassily, for being such a rude
> ignorant idiot. If you do not know anything about speech recognition,
> just shut up.
>
> This question on speech recognition has many possible answers.
>
> Please first define the domain. I will assume you are only trying to
> recognize a few words, so you will be doing limited vocabulary
> recognition. In this case, your best bet for Speech recognition would
> be neural networks. You train the neural network with a few samples of
> the intended words.
>
> However, it is not that simple and I will explain why. First, as
> somebody already said, you should do DTW (dynamic time warping), to
> normalize the length of the utterance. Aftwerwards, you should do
> cepstral analysis to obtain a feature vector to feed your neural
> network. A simple PIC maybe would not suffice.
>
> The preprocessing stage, with the filters and such, is just for
> increasing robustness and getting rid of the information we are not
> interested into.
>
> I recommend you to read the HTK Book introductory part to understood
> Hidden Markov Models based speech recognition. The book is available
> for free (after simple registration) on &#4294967295;http://htk.eng.cam.ac.uk/.
>
> Another solution would be looking for those specialized ICs, but have
> not tried them and maybe they are not cheap or readily available.
>
> So basically your system should consist of this, in this order,
> connected in cascade
>
> signal adquisition (microphone)
> Bandpass filter (can be a butterworth filter) between 100Hz and 4000Hz
> (the rest is redundant)
> Sampling A/D converter, sampling at least at 8Khz (recommended)
>
> Once the signal is into the microprocessor, the first thing you should
> do is voice activity detection (VAD). There are some algorithms for
> this, please google it.
>
> Once you have detected the beginning and the end of a utterance, you
> should do Dynamic Time warping to normalize its length, so it can be
> compared.
>
> Then, do framming and obtain cepstral coefficients.
>
> Feed your neural network and wait for the result.
>
> Of course, first you will need to train the neural network.
>
> If you have more doubts, do not hesitate to ask.
>
> Regards
>
> Juan Pablo

Dynamic time warping on the time domain signal?  Have you tried that?

Dirk

First, the one who should GFY is Vassily, for being such a rude
ignorant idiot. If you do not know anything about speech recognition,
just shut up.

This question on speech recognition has many possible answers.

Please first define the domain. I will assume you are only trying to
recognize a few words, so you will be doing limited vocabulary
recognition. In this case, your best bet for Speech recognition would
be neural networks. You train the neural network with a few samples of
the intended words.

However, it is not that simple and I will explain why. First, as
somebody already said, you should do DTW (dynamic time warping), to
normalize the length of the utterance. Aftwerwards, you should do
cepstral analysis to obtain a feature vector to feed your neural
network. A simple PIC maybe would not suffice.

The preprocessing stage, with the filters and such, is just for
increasing robustness and getting rid of the information we are not
interested into.

I recommend you to read the HTK Book introductory part to understood
Hidden Markov Models based speech recognition. The book is available
for free (after simple registration) on  http://htk.eng.cam.ac.uk/.

Another solution would be looking for those specialized ICs, but have
not tried them and maybe they are not cheap or readily available.

So basically your system should consist of this, in this order,
connected in cascade


signal adquisition (microphone)
Bandpass filter (can be a butterworth filter) between 100Hz and 4000Hz
(the rest is redundant)
Sampling A/D converter, sampling at least at 8Khz (recommended)

Once the signal is into the microprocessor, the first thing you should
do is voice activity detection (VAD). There are some algorithms for
this, please google it.

Once you have detected the beginning and the end of a utterance, you
should do Dynamic Time warping to normalize its length, so it can be
compared.

Then, do framming and obtain cepstral coefficients.

Feed your neural network and wait for the result.

Of course, first you will need to train the neural network.

If you have more doubts, do not hesitate to ask.

Regards

Juan Pablo

On Jan 12, 10:51&#4294967295;pm, Mandar Gokhale <stallo...@gmail.com> wrote:
>> "The trouble with using filters on their own will be that it response
>> to bullshit commands with one on your list."
>
> Could you make that more clear?...I mean,I know the Fourier transform
> of the correlation would give me the power spectral density...but
> it'll be slightly different every time the word is spoken...right?...I
> checked my voice saying the same words on a spectrum analysing
> software called Audacity...and the frequency spectrum was slightly
> different every time I said the same word........that's why I'm
> looking for a method to recognize the commands properly.......

Disclaimer: What little I know about SR came from a girl I used to
date...so these are just suggestions..:D

He means that if you simply use a a matched-filter (http://
en.wikipedia.org/wiki/Matched_filter) or other technique against the
time-domain signals, you will have trouble because, well...you're in
the time domain.  Two superposed utterances of the input signal, x1[n]
and x2[n], would look drastically different depending on their
relative phases, which is influenced by when you start sampling. Even
a small phase shift between x1[n] and x[2] will break your algorithm.

Yes, spectrum will indeed be slightly different each time, never
exact, but that's ok, as you simply need to distinguish between the
utterances. There are many ways to do this.  Perhaps the easiest is to
regard each |X[k]| of DFT of auto-correlation as components of a
vector.  There will be one vector associated with each utterance.  You
would get the user to utter the same word several times to find, more
or less, the |X[k's]| for a single word. This would involve
normalizing each DFT based on energy content (yelling versus
whispering same word), and finding X*[k], the signal that, when
regarded as a normalized vector among the other normalized vectors,
yields the minimum distance between itself and any of the other
vectors. Of course this is the distance formula in N-space among the
vectors.  After that, when word is uttered, you run through you bank
of X*[], and yield the index of the one that provides the minimum
distance. That will be the index of the uttered word (hopefully).

You can see that you will need to calculate the proper window for the
DFT correctly.  If you simply tell user, "Ok, I'm ready, speak.", and
nothing is said until user takes gum out of his/her mouth, you will
start sampling prematurely and stop sampling prematurely, so you will
have to determine when significant energy begins in the signal and
when it ends.

If I were you, before using a PIC, I would write a few programs in
software to do your experiments.  On Unix or Windows,  there are
plenty of pre-installed tools to sampl audio into a variety of
formats, do your processing, see what works, error rates, etc. Once
you find something you are comfortable with, you can move to hardware
with optimized algorithm.

-Le Chaud Lapin-