I'd like to try various non-intrusive features (parameters) that could be calculated from speech recordings (no clean reference) to eliminate bad recordings before doing speech recognition.

What set of parameters (features) could help to classify between good speech recordings that will probably be well recognized and degraded speech recordings that will not be properly recognized because of various distortions (reverberation, various background noises, clicks, pops, too silent, too loud, bad microphone, wrong endpoints, distant talk, nonlinear distortions, ...) ?

I guess at the end the combination of several features will help to identify the answer...

Thanks for advices and hints (would love also pointers to implementations in Matlab or/and Python). I plan to try various SNR based measures, then clean to reverberant energy ratio, ...

Thanks in advance, regards, Bulek...

[ - ]
Reply by jbrowerMarch 17, 2021


We use the EVS encoder to classify incoming audio as voice, background noise, or sound of some type, plus obtain additional confidence metrics for voice.

For speech recognition we use Kaldi, but not Kaldi's "online decoding" which can't handle real-world packet audio (