DSPRelated.com
Forums

PCA (Principal Components Analysis) Is it really adequate

Started by Nadav March 15, 2007
Hi,

I am implementing an application for matching/earching audio pattern
in audio files ( using MFCC ), I am looking for ways to increase the
efficiency of my detection algorithm.
Recently, I have been investigating PCA (Principal Components
Analysis) as advised by Ikaro of this group, PCA is used to express
the data by it's most significant dimensions discarding less
significant dimensions ( hence, compression ), the idea is to use less
data for pattern matching ( hence, increasing performance ), this
would be done by matching the searched PCAed patterns with the PCA
representation of the data being sampled.

Assuming the data I use has 13 dimensions ( 13 Mel-scale features ),
taking in mind PCA, what benefits does it has over selection of the
set of dimensions ( out of the 13 ) whose variance is maximal ?
( Variance calculation is much cheaper then PCA )

Any help would be appreciated

Thanks,
   Nadav Rubinstein

Hi Nadav,

> Assuming the data I use has 13 dimensions ( 13 Mel-scale features ), > taking in mind PCA, what benefits does it has over selection of the > set of dimensions ( out of the 13 ) whose variance is maximal ? > ( Variance calculation is much cheaper then PCA )
The main idea here is that you want to represent your feature space through an orthogonal set of components. While PCA provides such an orthogonal set, there can also be other desirable orthogonal transformations, such as Linear Discriminant Analysis (LDA or fisher discriminant analysis). LDA is an orthogonal transformation that attempts to maximize the variance across patterns while minimizin the variance within a pattern (it can be close to optimal as long as your patterns are linearly separable and you have a reprensentative set for performing the LDA). The LDA actually, might be a better option than the PCA for this task of yours. The advantage of the PCA or LDA over the original set of features (ie, mel features) is that you are now performing a search with orthogonal components. With the idea being that all redundancies have been "squeezed out", allowing you to sort which components contribute the most, and with the search in one dimension being independent of a search in another dimension.
> The advantage of the PCA or LDA over the original set of features (ie, > mel features) is that you are now performing a search with orthogonal > components. With the idea being that all redundancies have been > "squeezed out", allowing you to sort which components contribute the > most, and with the search in one dimension being independent of a > search in another dimension.
Hi Ikaro, Thanks for your help, As you have advised, I have been inspecting LDA, as far as I have understood, LDA is used to derive the transformation that will maximize inter-class separation to within class spread, the result of LDA is a collection of eigen vectors and values, the vectors whose corresponding values are the smallest are discarded ( as lowest the value so variance for it's dimension is lower ). Transforming the sampled dataset in inspection ( the test vector ) with the transformation matrix resulting from LDA ( that is the 'eigen matrix' ) will result the most separable representation of the data, the test class that would best describe the test vector is the one whose whose Euclidian distance is minimal. Taking the above in mind, As stated before, the application I am working on search for patterns in sampled audio, the test vector and patterns are represented by N dimensional MFCC features vectors, my application has thousands of patterns being searched for. Using LDA for classification would mean that: 1. Each searched audio pattern will be presented by a single N dimensional LDA data set ( or a class ) 2. the test vector would be the last T sampled N dimensional MFCC feature vectors ( representing the last T*M seconds ) Using LDA I would be able to get the pattern most similar to the test vector. One problem with this concept, I am searching for thousands of patterns in each 'iteration'. Resembling each pattern as a dataset will dramatically increase the dimensionality of the covariance matrix, this, will reduce performance and hence, makes usage of LDA in that manner to be non practical. Did I get LDA right? Are there any other ways to use LDA for pattern recognition regarding MFCC features? Any help would be appreciated. Thanks, Nadav Rubinstein
Hi,

Here are some of my thoughts in this...

> One problem with this concept, I am searching for thousands of > patterns in each 'iteration'.
I think what you are trying to do is match a test vector with throusands of classes (ie,patterns). Thinking back on this, I did not realize you had that many classes. Keep in mind that in general, the more classes you have the less likely your feature space will be linearly separable (if your number of classes is higher than your N features, than offcourse you can forget about LDA).
> Resembling each pattern as a dataset will dramatically increase the > dimensionality of the covariance matrix, this, will reduce performance > and hence, makes usage of LDA in that manner to be non practical. > > Did I get LDA right? Are there any other ways to use LDA for pattern > recognition regarding MFCC features?
Almost. You don't want to get an LDA for *every* class but for the whole class space. In other words each class is simply mapped to another space through LDA (the LDA is the same and is done with all classes together). So in the end each class is not a separte dataset but a vector of transformed features in the unique LDA space (you use the mean of each class to measure the distance from the test target). Because the number of classes you have seem so big (as mentioned before), you may also consider vector quantization techniques (which can be used in conjuction with LDA). Where the mean (center) of each class is stored as a vector of features in a codebook (by using cluster analysis on the traning dataset) and some sort of distance measure is applied to determine which class in the codebook the test pattern belongs. The book "Fundamentals of Speech Recognition" by Rabiner and Juang has some good background on vector quantitization techniques. If anyone else here in group has additional suggestions and comments I will be interested in hearing as well ! hth ikaro
> If anyone else here in group has additional suggestions and comments I > will be interested in hearing as well ! > > hth > > ikaro
Hi, I had some additional thoughts regarding PCA, today, the way I classify patterns is using Pearson's product-moment coefficient, this give me the adequate results ( it gives the best matching pattern out of thousands ), the product moment coefficient is used to measure correlation for N dimensional data-sets, this, costs allot of CPU and is the bottle-neck I am trying to resolve. The idea I thought of is to use PCA to reduce the sampled data set dimension, hence, rather then using N dimensional MFCC vectors PCA will be used to extract the most significant dimension, then, only these dimensions will be used for correlation, in other words: 1. Patterns are represented by a dataset matrix build of a fixed amount of N dimensional MFCC vectors 2. The sampled data is of the same dimension and length of the patterns 3. On each iteration the sampled data is correlated to the thousands of patterns ( the correlation is done on each of the N dimensions ) To increase performance, the dimension of the data should be reduced, PCA would be used on each of the searched patterns ( this is done offline ) and on the sampled data ( online ), this, will enable expression of the data dimension with a fixed ( and smaller ) amount of the most significant dimensions, the same process will be done for the data being sampled on runtime, then the transformed ( PCAed ) data will be compared with the searched patterns ( that were transformed offline ), this will dramatically reduce the number of dimensions to be correlated. Can U think of any problem with this algorithm ? For the sampled data the resulting PCA transformation may differ, AND for the same data there may be two transformations slightly different then each other, should this make the resulting ( transformed ) data a bad candidate for correlation ? Any comment would be appreciated. Nadav