I have an audio recording (wav file) that is 2 seconds long. That is my sample and it needs to be classified as [class_A] or [class_B].
To extract MFCC features, I divided the sample into frames (183 frames to be exact) and I've gotten MFCCs (13 coefficients) from each frame.
Now I have 183 feature vectors, the length of each vector is 13.
My question is; how exactly do you use those vectors with classifier (k-NN or SVM for example)? I have 183 vectors that represent 1 sample. I know how to work with 1 vector for 1 sample, but I don't know what to do if I have 183 of them.
Should I concatenate the MFCC vectors from each audio file to transform a matrix of 13x182 to 1x2366 (so I have 2366 MFCC features for every audio file instead of 13 for every frame?), should I use the mean ?
Apologize in advance for any possible obvious explanation that I missed.
I would try both approaches and see if there is a performance difference using the same type of classifier.
First, train the classifier using the 183 MFCC feature vectors by concatenating them into a single input vector as suggested by CFELTON. In this case you'd have one input vector and one output class for each of the sentences in your WAV-file dataset.
Second, train the classifier using the 183, 13 coefficient MFCC vectors as separate input frames of the same class. I am not sure if this is a viable approach and it would depend in large part, I think, on the spectro-temporal differences between the speech samples in Class A and Class B. However, it is an interesting idea to pursue.
There are, of course, alternative approaches that combine multiple features sets composed of different measures of the speech stimuli. Also, I understand recurrent neural networks are used with stimuli such as speech that use the information contained in the time dependency of the speech time frames. Here are two papers that may provide additional ideas - Chen_and_Wang_2016.pdf and Delfarah_and_Wang_Iinterspeech_2016.pdf.
Yes, simply concat them together to a flat input to the "machine", it doesn't know what the data is other than it is a collection of features of something. You just need to make sure it is the same feature set each time. You can even mix time and frequency features as an input feature set.
The question: if you should use frame based or full length depends on what you are trying to achieve, since you want to classify the full audio segment I would start with the one large vector as an input. You might be able to get by with less "features".
You will need to look at the software/library you are using to determine if you can input a multi-dimensional feature array.
Disclaimer, it has been many years since I did this but I did a similar project successfully with SVM many years ago.
Thank you very much for your insights. I will try both approaches (though I need to dig more into the second because I’ve never done something like that before).
For the first approach, for an audio file of 44100 simples (and 88 frames of 13 MFCC coefficients): Going from 13x88 matrix to 1x1144, should I necessarily do feature selection/dimensionality reduction ?