Hi guys! I'm a student and right now I'm following the Digital Signal Processing course at the university. I have to do this exercise using Python:
Analysis of vocal traces with the STFT trying to identify the parts of the signal related to harmonic (vocalized) and non (consonant) sounds. In the case of non-harmonic sounds, you can try to estimate the AR model and then filter out some random noise with the same model. Listening should be able to recognize the original sound.
I'm having a hard time figuring out what to do! I don't really know where to start! I was wondering what the difference was between harmonic (vocalized) and non (consonant) sounds. what is an example of these two types? What topics do I need to know to do this exercise? Recommended books / sites for studying? Can anyone explain this exercise to me please?
Is that talking about voiced vs voiceless consonants? https://campusweb.howardcc.edu/ehicks/YE618/Master...
Or harmonic vs non-harmonic content? The question is confusing to me because I don't know anything about speech recognition.
If it's harmonic vs non-harmonic, you should be able to find the harmonics easily, and if not, then it must be non-harmonic. If it's voiced vs voiceless, then use those terms for searching - there must be a lot of info on speech recognition now. The hard part is figuring out what people call the terms. Once you find the key words, you should be able to figure out the way to answer the question.
In a speech (lets say a word utterance), simply vowels are vocalized (harmonic) parts, consonants are plosives, fricatives, nasals, liquids, and semivowels. You may check these.
It is easiest to think of this in terms of the basic LPC model for speech.
Consonants come in pairs which share the same AR synthesis filter (the poles of which are called "formants") but have different excitations.
For voiced consonants the excitation is a pulse train (the pulse rate being called the "pitch") and thus the signal has a line spectrum. For unvoiced consonants the excitation is a noise signal and thus there is a continuous spectrum.
That's the basic theory, but of course in practice it isn't so simple. There are sounds that are partially voiced (e.g., zh), people can whisper in which case there is no voicing, and it can really hard to find the pitch and formants for rapid speech.
It's all explained in chapter 19 of DSPCSP.
Hi, the best way to analyze human speech, other than the traditional fft based spectrum and spectrogram (spectrum vs time), is the cepstrum that clearly makes evident the voiced sounds vs consonants. In simple words the cepstrum is the spectrum of the fft of a signal. On voiced sounds the cepstrum shows a peak corresponding to the pitch of the voice, i.e. shoes the periodicity of the vowel spectrum. On the contrary, the cespstrum of a consonant doesn't show any periodicity in the consonant spectrum.
The cepstrum domain is a strange domain where all traditional terms are anagrams of spectrum terms: cepstrum (spectrum), quefrency, mite (time), rahmonic (harmonic), etc.
Cepstrum is worth of exploration for new analysis functions.