Hello! I have a project in mind, part of which relies on comparing how similar some utterances are.
Basically, I have a reference audio file of how a word, say, should be pronounced against which I want to compare a user voice input.
For example, if the reference is "helooOOOOo" and the user input is "heloOOo", I want to be able to say that the user more or less pronounced the word correctly.
Off the top of my head, what it seems that I need to do is:
1. Get rid of the silent parts from the user input
2. Downsample/upsample the user input to match the reference audio
3. Transform both the user input and the reference audio to some representation that does away with voice deepness (deep, sharp, ...), so that I am left with the "essence" of the utterance
4. Use some means to compare the resulting signals
I honestly have no background in signal processing, so it would be cool if you could tell me whether you think I have left out any steps, and what tools we have, or algorithms we know of, that can perform these operations.
Hi, one approach would be to convert the audio clip (after removing silence resampling to ref fs, etc) to mel spectrogram. for example, if you have the ref clip of 3 sec, and fs = 16k. if you use framelen = 256 with required overlap so that you get a spectrogram of required aspect.
once you get a spectrogram you can decide the similarity using several approaches. straightforward approach would be to train a binary classifier model with same OR different. for this you need sound samples.
you can also use any pre-trained model for similarity. you may want to tune the spectrogram parameters for your (expected) sound characteristics.
good luck !
If you have no signal processing background, these tasks would be very hard for you.
Alternatively, If you have deep learning knowledge, you may use Neural Networks approach for this type of voice recognition applications.
Is there any implicit assumption that the utterances came from the same person?
If so, then some of what you have suggested might work. You will probably need some dynamic time warping algorithm to get the utterances lined up.
If they are from different people then:
I think that you need to get into some fairly deep stuff for it to work right.
You will need to make it independent of the pitch, and probably should look at speech prediction techniques used for speech coding and voice recognition.