Hi everybody, I am new to this group and could find that many of the group members know much about speaker identification. I am a graduate student, now doing my final year project on speaker recognition. The problem is to perform speaker recognition on movie clips. I have no previous experience with speech processing. I could eliminate silence, environment sounds etc. from the audio signal to a satisfactory extent. The next step is MFCC calculation and the classification. I tried a lot doing it. But the output MFCC vectors I get do not seem to be correct (I am not sure actually). The vectors for different speakers do not seem to be distinguishably different, and those belong to the same speaker don�t seem to be sufficiently similar even. Has this problem ever occurred to anybody? Any suggestions are highly welcomed. I read somewhere that we should perform some cepstral normalization on the MFCC vectors. But I don�t know how. Can anybody help please? Or is it the case that by just looking at the vectors we cannot determine the similarity or dissimilarity of the MFCC vectors? Anybody with an experience with MFCC please help. Thanks in advance jasine
mfcc calculation - help please
Started by ●November 8, 2005
Reply by ●November 8, 20052005-11-08
Mel-Frequency Cepstral Coefficients are not magic -- they are one relatively compact way of describing a power spectrum. The short-time spectrum of speech is primarily determined by the phonetics, the speech sound over that duration -- whether an "ah", an "s", etc. Speaker recognition depends on detecting the differences in how different speakers make all the various speech sounds, which is a second-order effect. If you need to recognize speakers text-independently, as your problem statement suggests, then then you need a relatively large amount of speech, enough to average out the phonetics. That's not so easy. In addition to the speech sound and the speaker, the speech spectrum is also affected by several other factors, an important one of which is the transmission channel characteristics. The practice of cepstral normalization is used in speech and speaker recognition to minimize this channel effect. Estimate a long term average (MFCC) cepstrum from each speech sample and subtract it from each short-term cepstral vector. If the signal has been filtered by a constant channel, its spectrum is multiplied by the response of that filter, and this is additive in the cepstral domain, so subtracting the estimated mean discards such a constant effect -- the channel (and also some speaker information, so it's a tradeoff). cheers, jerry wolf