DSPRelated.com
Forums

Better algorithms for speech spectrograms?

Started by waywardgeek December 29, 2010
Dale has posted several good links in the few days I've been reading this
forum.  The constant Q DFT paper looks like it might have some value in
generating speech spectrograms.  Does anyone have experience with this? 
Are there other DFT algorithms I should be looking at?

The common approach used in the open source packages I've read (Praat,
libsnack), is to use a short-time FFT, with a window function like Hamming.
 I've found significant benefit in time-aliasing two adjacent pitch periods
of vowels as a pre-process to an FFT, which seems to give better results. 
Are there additional steps that could be taken to improve a speech
spectrogram?

Bill
On Dec 29, 9:26&#4294967295;am, "waywardgeek" <waywardgeek@n_o_s_p_a_m.gmail.com>
wrote:
> Dale has posted several good links in the few days I've been reading this > forum. &#4294967295;The constant Q DFT paper looks like it might have some value in > generating speech spectrograms. &#4294967295;Does anyone have experience with this? > Are there other DFT algorithms I should be looking at? > > The common approach used in the open source packages I've read (Praat, > libsnack), is to use a short-time FFT, with a window function like Hamming. > &#4294967295;I've found significant benefit in time-aliasing two adjacent pitch periods > of vowels as a pre-process to an FFT, which seems to give better results. > Are there additional steps that could be taken to improve a speech > spectrogram? > > Bill
Google pitch-synchronous speech processing A good pitch + voiced/unvoiced speech detector is a prerequisite Better yet, forget about Fourier Transform altogether Why ? Time-frequency resolution tradeoff, linear frequency scale leading to poor spectral resolution at low frequencies etc. etc. etc. Google "Fast Cochlea Transform" http://www.audience.com/download_files/Instantaneous%20Noise%20Suppression.pdf
fatalist wrote:
> On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > wrote: >> Dale has posted several good links in the few days I've been reading this >> forum. The constant Q DFT paper looks like it might have some value in >> generating speech spectrograms. Does anyone have experience with this? >> Are there other DFT algorithms I should be looking at? >> >> The common approach used in the open source packages I've read (Praat, >> libsnack), is to use a short-time FFT, with a window function like Hamming. >> I've found significant benefit in time-aliasing two adjacent pitch periods >> of vowels as a pre-process to an FFT, which seems to give better results. >> Are there additional steps that could be taken to improve a speech >> spectrogram? >> >> Bill > > Google pitch-synchronous speech processing > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > Better yet, forget about Fourier Transform altogether > > Why ? > > Time-frequency resolution tradeoff, linear frequency scale leading to > poor spectral resolution at low frequencies etc. etc. etc. > > Google "Fast Cochlea Transform" > > http://www.audience.com/download_files/Instantaneous%20Noise%20Suppression.pdf
Neither that article, nor any of the Google links I checked define an FCT. Saying that it "is not an FFT" is as exactly as true an and as useful as saying "a mackintosh is not an apple" or "a cobbler has little to do with boots".
On Dec 29, 10:23&#4294967295;pm, Richard Owlett <rowl...@pcnetinc.com> wrote:
> fatalist wrote: > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > wrote: > >> Dale has posted several good links in the few days I've been reading this > >> forum. &#4294967295;The constant Q DFT paper looks like it might have some value in > >> generating speech spectrograms. &#4294967295;Does anyone have experience with this? > >> Are there other DFT algorithms I should be looking at? > > >> The common approach used in the open source packages I've read (Praat, > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > >> &#4294967295; I've found significant benefit in time-aliasing two adjacent pitch periods > >> of vowels as a pre-process to an FFT, which seems to give better results. > >> Are there additional steps that could be taken to improve a speech > >> spectrogram? > > >> Bill > > > Google pitch-synchronous speech processing > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > Better yet, forget about Fourier Transform altogether > > > Why ? > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > poor spectral resolution at low frequencies etc. etc. etc. > > > Google "Fast Cochlea Transform" > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > > Neither that article, nor any of the Google links I checked > define an FCT. Saying that it "is not an FFT" is as exactly as > true an and as useful as saying "a mackintosh is not an apple" or > "a cobbler has little to do with boots".
typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise a specific filter nabk. Chris ======================== Chris Bore BORES Signal Processing www.bores.com
On Dec 30, 2:10&#4294967295;am, Chris Bore <chris.b...@gmail.com> wrote:
> On Dec 29, 10:23&#4294967295;pm, Richard Owlett <rowl...@pcnetinc.com> wrote: > > > > > fatalist wrote: > > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > > wrote: > > >> Dale has posted several good links in the few days I've been reading this > > >> forum. &#4294967295;The constant Q DFT paper looks like it might have some value in > > >> generating speech spectrograms. &#4294967295;Does anyone have experience with this? > > >> Are there other DFT algorithms I should be looking at? > > > >> The common approach used in the open source packages I've read (Praat, > > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > > >> &#4294967295; I've found significant benefit in time-aliasing two adjacent pitch periods > > >> of vowels as a pre-process to an FFT, which seems to give better results. > > >> Are there additional steps that could be taken to improve a speech > > >> spectrogram? > > > >> Bill > > > > Google pitch-synchronous speech processing > > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > > Better yet, forget about Fourier Transform altogether > > > > Why ? > > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > > poor spectral resolution at low frequencies etc. etc. etc. > > > > Google "Fast Cochlea Transform" > > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > ...
> > typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise > a specific filter nabk. > > Chris
Audience is consistent in their use of cochlea not cochlear in their references to their 'proprietary Fast Cochlea Transform', but it's still just advertising copy. In fact, they claim a trade mark on 'Fast Cochlea Transform'. Dale B Dalrymple
On Dec 30, 5:10&#4294967295;am, Chris Bore <chris.b...@gmail.com> wrote:
> On Dec 29, 10:23&#4294967295;pm, Richard Owlett <rowl...@pcnetinc.com> wrote: > > > > > > > fatalist wrote: > > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > > wrote: > > >> Dale has posted several good links in the few days I've been reading this > > >> forum. &#4294967295;The constant Q DFT paper looks like it might have some value in > > >> generating speech spectrograms. &#4294967295;Does anyone have experience with this? > > >> Are there other DFT algorithms I should be looking at? > > > >> The common approach used in the open source packages I've read (Praat, > > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > > >> &#4294967295; I've found significant benefit in time-aliasing two adjacent pitch periods > > >> of vowels as a pre-process to an FFT, which seems to give better results. > > >> Are there additional steps that could be taken to improve a speech > > >> spectrogram? > > > >> Bill > > > > Google pitch-synchronous speech processing > > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > > Better yet, forget about Fourier Transform altogether > > > > Why ? > > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > > poor spectral resolution at low frequencies etc. etc. etc. > > > > Google "Fast Cochlea Transform" > > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > > > Neither that article, nor any of the Google links I checked > > define an FCT. Saying that it "is not an FFT" is as exactly as > > true an and as useful as saying "a mackintosh is not an apple" or > > "a cobbler has little to do with boots". > > typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise > a specific filter nabk. > > Chris > ======================== > Chris Bore > BORES Signal Processingwww.bores.com- Hide quoted text - > > - Show quoted text -
More like a filter cascade with several hundred logarithmically spaced filters performing progressive low-pass filtering and downsampling
dbd wrote:
> On Dec 30, 2:10 am, Chris Bore<chris.b...@gmail.com> wrote: >> On Dec 29, 10:23 pm, Richard Owlett<rowl...@pcnetinc.com> wrote: >> >> >> >>> fatalist wrote: >>>> On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> >>>> wrote: >>>>> Dale has posted several good links in the few days I've been reading this >>>>> forum. The constant Q DFT paper looks like it might have some value in >>>>> generating speech spectrograms. Does anyone have experience with this? >>>>> Are there other DFT algorithms I should be looking at? >> >>>>> The common approach used in the open source packages I've read (Praat, >>>>> libsnack), is to use a short-time FFT, with a window function like Hamming. >>>>> I've found significant benefit in time-aliasing two adjacent pitch periods >>>>> of vowels as a pre-process to an FFT, which seems to give better results. >>>>> Are there additional steps that could be taken to improve a speech >>>>> spectrogram? >> >>>>> Bill >> >>>> Google pitch-synchronous speech processing >> >>>> A good pitch + voiced/unvoiced speech detector is a prerequisite >> >>>> Better yet, forget about Fourier Transform altogether >> >>>> Why ? >> >>>> Time-frequency resolution tradeoff, linear frequency scale leading to >>>> poor spectral resolution at low frequencies etc. etc. etc. >> >>>> Google "Fast Cochlea Transform" >> >>>> http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... >> ...
May I reinsert what I said? I will anyway ;/
>> Neither that article, nor any of the Google links I checked define an FCT. >> Saying that it "is not an FFT" is as exactly as true an and as useful as saying >> "a mackintosh is not an apple" or "a cobbler has little to do with boots".
BTW, "mackintosh" is *not* misspelled above.
> >> >> typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise >> a specific filter nabk. >> >> Chris > > Audience is consistent in their use of cochlea not cochlear in their > references to their 'proprietary Fast Cochlea Transform', but it's > still just advertising copy. > > In fact, they claim a trade mark on 'Fast Cochlea Transform'. > > Dale B Dalrymple
Snicker.
On Dec 30, 12:08&#4294967295;pm, dbd <d...@ieee.org> wrote:
> On Dec 30, 2:10&#4294967295;am, Chris Bore <chris.b...@gmail.com> wrote: > > > > > On Dec 29, 10:23&#4294967295;pm, Richard Owlett <rowl...@pcnetinc.com> wrote: > > > > fatalist wrote: > > > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > > > wrote: > > > >> Dale has posted several good links in the few days I've been reading this > > > >> forum. &#4294967295;The constant Q DFT paper looks like it might have some value in > > > >> generating speech spectrograms. &#4294967295;Does anyone have experience with this? > > > >> Are there other DFT algorithms I should be looking at? > > > > >> The common approach used in the open source packages I've read (Praat, > > > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > > > >> &#4294967295; I've found significant benefit in time-aliasing two adjacent pitch periods > > > >> of vowels as a pre-process to an FFT, which seems to give better results. > > > >> Are there additional steps that could be taken to improve a speech > > > >> spectrogram? > > > > >> Bill > > > > > Google pitch-synchronous speech processing > > > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > > > Better yet, forget about Fourier Transform altogether > > > > > Why ? > > > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > > > poor spectral resolution at low frequencies etc. etc. etc. > > > > > Google "Fast Cochlea Transform" > > > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > > ... > > > typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise > > a specific filter nabk. > > > Chris > > Audience is consistent in their use of cochlea not cochlear in their > references to their 'proprietary Fast Cochlea Transform', but it's > still just advertising copy. > > In fact, they claim a trade mark on 'Fast Cochlea Transform'. > > Dale B Dalrymple
That's actually quite funny. Isn't cochlea (without an r) a noun, while cochlear (with an r) means 'having to do with the cochlea'? As in the cochlea (the organ in the ear) but the cochlear nerve, duct etc. And does that mean that the patents on the Fast Cochlear Transform (with an r) do not violate those on the Fast Cochlea Transorm (without an r)?
Originally, I had thought a logarithmic transform DFT, the Fast Cochlea
Transform, would be superior to a linear transform like the FFT.  I've
since become more convinced that traditional FFT based spectrograms are
closer to the right way to go.

The main reason is that vowels are highly harmonic, and any spectrogram
that samples in between harmonics of the fundamental pitch will show deep
horizontal valleys.  Tracking formant frequencies becomes harder, as they
travel down through these valleys and over the ridges of the harmonics,
independently of the fundamental pitch.

Another problem in tracking formants is the vertical ridges created by
short-time FFTs, which are in sync with the fundamental pitch.

What I'm doing instead is to first determine the pitch, and then do a
2-block time-aliased FFT of two pitch periods, currently with a Hann
window.  This does several things for me.  I need to track the fundamental
pitch anyway, as it contains word boundary information and other prosodic
information.  It eliminates the valleys created by sampling in between
harmonics.  It eliminates the vertical ridges created by the short-time
FFT.  It also eliminates the common repeating pin-hole patterns in
short-time spectrograms.  As much of the energy is harmonic, and harmonics
cause zero spectral leakage, spectral noise is reduced.  Finally,
computation is dramatically reduced, as the FFT window is much shorter than
the usual 25-ish ms short-time FFT window, and I take full pitch period
sized steps.

I think all of this will help me match spectrograms in a speech recognition
and TTS systems.  However, I'm still new to this field, and would
appreciate any tips on cleaning up voice spectrograms.

Bill
On Jan 2, 4:23=A0am, "waywardgeek" <waywardgeek@n_o_s_p_a_m.gmail.com>
wrote:
> Originally, I had thought a logarithmic transform DFT, the Fast Cochlea > Transform, would be superior to a linear transform like the FFT. =A0I've > since become more convinced that traditional FFT based spectrograms are > closer to the right way to go. > > The main reason is that vowels are highly harmonic, and any spectrogram > that samples in between harmonics of the fundamental pitch will show deep > horizontal valleys. =A0Tracking formant frequencies becomes harder, as th=
ey
> travel down through these valleys and over the ridges of the harmonics, > independently of the fundamental pitch. > > Another problem in tracking formants is the vertical ridges created by > short-time FFTs, which are in sync with the fundamental pitch. > > What I'm doing instead is to first determine the pitch, and then do a > 2-block time-aliased FFT of two pitch periods, currently with a Hann > window. =A0This does several things for me. =A0I need to track the fundam=
ental
> pitch anyway, as it contains word boundary information and other prosodic > information. =A0It eliminates the valleys created by sampling in between > harmonics. =A0It eliminates the vertical ridges created by the short-time > FFT. =A0It also eliminates the common repeating pin-hole patterns in > short-time spectrograms. =A0As much of the energy is harmonic, and harmon=
ics
> cause zero spectral leakage, spectral noise is reduced. =A0Finally, > computation is dramatically reduced, as the FFT window is much shorter th=
an
> the usual 25-ish ms short-time FFT window, and I take full pitch period > sized steps. > > I think all of this will help me match spectrograms in a speech recogniti=
on
> and TTS systems. =A0However, I'm still new to this field, and would > appreciate any tips on cleaning up voice spectrograms. > > Bill
How can you do FFT of two full pitch periods (without signal re- sampling or zero-padding) when each pitch period contains a variable number of signal samples (it can be a fractional number) ? What is a "pitch period" ? Where does it start and where does it end ? These are not simple questions And BTW, current "state-of-the-art" statistical speech recognition paradigm is crap anyway, so you don't have to worry too much about it...