> If anyone else here in group has additional suggestions and comments I
> will be interested in hearing as well !
>
> hth
>
> ikaro

Hi,

I had some additional thoughts regarding PCA, today, the way I
classify patterns is using Pearson's product-moment coefficient, this
give me the adequate results ( it gives the best matching pattern out
of thousands ), the product moment coefficient is used to measure
correlation for N dimensional data-sets, this, costs allot of CPU and
is the bottle-neck I am trying to resolve.
The idea I thought of is to use PCA to reduce the sampled data set
dimension, hence, rather then using N dimensional MFCC vectors PCA
will be used to extract the most significant dimension, then, only
these dimensions will be used for correlation, in other words:
1. Patterns are represented by a dataset matrix build of a fixed
amount of N dimensional MFCC vectors
2. The sampled data is of the same dimension and length of the
patterns
3. On each iteration the sampled data is correlated to the thousands
of patterns ( the correlation is done on each of the N dimensions )

To increase performance, the dimension of the data should be reduced,
PCA would be used on each of the searched patterns ( this is done
offline ) and on the sampled data ( online ), this, will enable
expression of the data dimension with a fixed ( and smaller ) amount
of the most significant dimensions, the same process will be done for
the data being sampled on runtime, then the transformed ( PCAed ) data
will be compared with the searched patterns ( that were transformed
offline ), this will dramatically reduce the number of dimensions to
be correlated.

Can U think of any problem with this algorithm ?
For the sampled data the resulting PCA transformation may differ, AND
for the same data there may be two transformations slightly different
then each other, should this make the resulting ( transformed ) data a
bad candidate for correlation ?

Any comment would be appreciated.

Nadav

Hi,

Here are some of my thoughts in this...

> One problem with this concept, I am searching for thousands of
> patterns in each 'iteration'.

I think what you are trying to do is match a test vector with
throusands of classes (ie,patterns).
Thinking back on this, I did not realize you had that many classes.
Keep in mind that in general, the more classes
you have the less likely your feature space will be linearly separable
(if your number of classes is higher than your N features, than
offcourse you
can forget about LDA).


> Resembling each pattern as a dataset will dramatically increase the
> dimensionality of the covariance matrix, this, will reduce performance
> and hence, makes usage of LDA in that manner to be non practical.
>
> Did I get LDA right? Are there any other ways to use LDA for pattern
> recognition regarding MFCC features?

Almost. You don't want to get an LDA for *every* class but for the
whole class space. In other words each class is simply mapped to
another space through LDA (the LDA is the same and is done with all
classes together). So in the end each class is not a separte dataset
but a vector of transformed features in the unique LDA space (you use
the mean of each class to measure the distance from the test target).

Because the number of classes you have seem so big (as mentioned
before), you may also consider vector quantization techniques (which
can be used in conjuction with LDA). Where the mean (center) of each
class is stored as a vector of features in a codebook (by using
cluster analysis on the traning dataset) and some sort of distance
measure is applied to determine which class in the codebook the test
pattern belongs. The book "Fundamentals of Speech Recognition" by
Rabiner and Juang has some good background on vector quantitization
techniques.

If anyone else here in group has additional suggestions and comments I
will be interested in hearing as well !

hth

ikaro

> The advantage of the PCA or LDA over the original set of features (ie,
> mel features) is that you are now performing a search with orthogonal
> components. With the idea being that all redundancies have been
> "squeezed out", allowing you to sort which components contribute the
> most, and with the search in one dimension being independent of a
> search in another dimension.

Hi Ikaro,

Thanks for your help, As you have advised, I have been inspecting LDA,
as far as I have understood, LDA is used to derive the transformation
that will maximize inter-class separation to within class spread, the
result of LDA is a collection of eigen vectors and values, the vectors
whose corresponding values are the smallest are discarded ( as lowest
the value so variance for it's dimension is lower ).
Transforming the sampled dataset in inspection ( the test vector )
with the transformation matrix resulting from LDA ( that is the 'eigen
matrix' ) will result the most separable representation of the data,
the test class that would best describe the test vector is the one
whose whose Euclidian distance is minimal.

Taking the above in mind, As stated before, the application I am
working on search for patterns in sampled audio, the test vector and
patterns are represented by N dimensional MFCC features vectors, my
application has thousands of patterns being searched for.
Using LDA for classification would mean that:
1. Each searched audio pattern will be presented by a single N
dimensional LDA data set ( or a class )
2. the test vector would be the last T sampled N dimensional MFCC
feature vectors ( representing the last T*M seconds )

Using LDA I would be able to get the pattern most similar to the test
vector.

One problem with this concept, I am searching for thousands of
patterns in each 'iteration'.
Resembling each pattern as a dataset will dramatically increase the
dimensionality of the covariance matrix, this, will reduce performance
and hence, makes usage of LDA in that manner to be non practical.

Did I get LDA right? Are there any other ways to use LDA for pattern
recognition regarding MFCC features?

Any help would be appreciated.

Thanks,
   Nadav Rubinstein

Hi Nadav,

> Assuming the data I use has 13 dimensions ( 13 Mel-scale features ),
> taking in mind PCA, what benefits does it has over selection of the
> set of dimensions ( out of the 13 ) whose variance is maximal ?
> ( Variance calculation is much cheaper then PCA )


The main idea here is that you want to represent your feature space
through an orthogonal set of components.
While PCA provides such an orthogonal set, there can also be other
desirable orthogonal transformations, such as Linear Discriminant
Analysis (LDA or fisher discriminant analysis). LDA is an orthogonal
transformation that attempts to maximize the variance across patterns
while minimizin the variance within a pattern (it can be close to
optimal as long as your patterns are linearly separable  and you have
a reprensentative set for performing the LDA). The LDA actually, might
be a better option than the PCA for this task of  yours.

The advantage of the PCA or LDA over the original set of features (ie,
mel features) is that you are now performing a search with orthogonal
components. With the idea being that all redundancies have been
"squeezed out", allowing you to sort which components contribute the
most, and with the search in one dimension being independent of a
search in another dimension.

Hi,

I am implementing an application for matching/earching audio pattern
in audio files ( using MFCC ), I am looking for ways to increase the
efficiency of my detection algorithm.
Recently, I have been investigating PCA (Principal Components
Analysis) as advised by Ikaro of this group, PCA is used to express
the data by it's most significant dimensions discarding less
significant dimensions ( hence, compression ), the idea is to use less
data for pattern matching ( hence, increasing performance ), this
would be done by matching the searched PCAed patterns with the PCA
representation of the data being sampled.

Assuming the data I use has 13 dimensions ( 13 Mel-scale features ),
taking in mind PCA, what benefits does it has over selection of the
set of dimensions ( out of the 13 ) whose variance is maximal ?
( Variance calculation is much cheaper then PCA )

Any help would be appreciated

Thanks,
   Nadav Rubinstein