Dale has posted several good links in the few days I've been reading this forum. The constant Q DFT paper looks like it might have some value in generating speech spectrograms. Does anyone have experience with this? Are there other DFT algorithms I should be looking at? The common approach used in the open source packages I've read (Praat, libsnack), is to use a short-time FFT, with a window function like Hamming. I've found significant benefit in time-aliasing two adjacent pitch periods of vowels as a pre-process to an FFT, which seems to give better results. Are there additional steps that could be taken to improve a speech spectrogram? Bill
Better algorithms for speech spectrograms?
Started by ●December 29, 2010
Reply by ●December 29, 20102010-12-29
On Dec 29, 9:26�am, "waywardgeek" <waywardgeek@n_o_s_p_a_m.gmail.com> wrote:> Dale has posted several good links in the few days I've been reading this > forum. �The constant Q DFT paper looks like it might have some value in > generating speech spectrograms. �Does anyone have experience with this? > Are there other DFT algorithms I should be looking at? > > The common approach used in the open source packages I've read (Praat, > libsnack), is to use a short-time FFT, with a window function like Hamming. > �I've found significant benefit in time-aliasing two adjacent pitch periods > of vowels as a pre-process to an FFT, which seems to give better results. > Are there additional steps that could be taken to improve a speech > spectrogram? > > BillGoogle pitch-synchronous speech processing A good pitch + voiced/unvoiced speech detector is a prerequisite Better yet, forget about Fourier Transform altogether Why ? Time-frequency resolution tradeoff, linear frequency scale leading to poor spectral resolution at low frequencies etc. etc. etc. Google "Fast Cochlea Transform" http://www.audience.com/download_files/Instantaneous%20Noise%20Suppression.pdf
Reply by ●December 29, 20102010-12-29
fatalist wrote:> On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > wrote: >> Dale has posted several good links in the few days I've been reading this >> forum. The constant Q DFT paper looks like it might have some value in >> generating speech spectrograms. Does anyone have experience with this? >> Are there other DFT algorithms I should be looking at? >> >> The common approach used in the open source packages I've read (Praat, >> libsnack), is to use a short-time FFT, with a window function like Hamming. >> I've found significant benefit in time-aliasing two adjacent pitch periods >> of vowels as a pre-process to an FFT, which seems to give better results. >> Are there additional steps that could be taken to improve a speech >> spectrogram? >> >> Bill > > Google pitch-synchronous speech processing > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > Better yet, forget about Fourier Transform altogether > > Why ? > > Time-frequency resolution tradeoff, linear frequency scale leading to > poor spectral resolution at low frequencies etc. etc. etc. > > Google "Fast Cochlea Transform" > > http://www.audience.com/download_files/Instantaneous%20Noise%20Suppression.pdfNeither that article, nor any of the Google links I checked define an FCT. Saying that it "is not an FFT" is as exactly as true an and as useful as saying "a mackintosh is not an apple" or "a cobbler has little to do with boots".
Reply by ●December 30, 20102010-12-30
On Dec 29, 10:23�pm, Richard Owlett <rowl...@pcnetinc.com> wrote:> fatalist wrote: > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > wrote: > >> Dale has posted several good links in the few days I've been reading this > >> forum. �The constant Q DFT paper looks like it might have some value in > >> generating speech spectrograms. �Does anyone have experience with this? > >> Are there other DFT algorithms I should be looking at? > > >> The common approach used in the open source packages I've read (Praat, > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > >> � I've found significant benefit in time-aliasing two adjacent pitch periods > >> of vowels as a pre-process to an FFT, which seems to give better results. > >> Are there additional steps that could be taken to improve a speech > >> spectrogram? > > >> Bill > > > Google pitch-synchronous speech processing > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > Better yet, forget about Fourier Transform altogether > > > Why ? > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > poor spectral resolution at low frequencies etc. etc. etc. > > > Google "Fast Cochlea Transform" > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > > Neither that article, nor any of the Google links I checked > define an FCT. Saying that it "is not an FFT" is as exactly as > true an and as useful as saying "a mackintosh is not an apple" or > "a cobbler has little to do with boots".typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise a specific filter nabk. Chris ======================== Chris Bore BORES Signal Processing www.bores.com
Reply by ●December 30, 20102010-12-30
On Dec 30, 2:10�am, Chris Bore <chris.b...@gmail.com> wrote:> On Dec 29, 10:23�pm, Richard Owlett <rowl...@pcnetinc.com> wrote: > > > > > fatalist wrote: > > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > > wrote: > > >> Dale has posted several good links in the few days I've been reading this > > >> forum. �The constant Q DFT paper looks like it might have some value in > > >> generating speech spectrograms. �Does anyone have experience with this? > > >> Are there other DFT algorithms I should be looking at? > > > >> The common approach used in the open source packages I've read (Praat, > > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > > >> � I've found significant benefit in time-aliasing two adjacent pitch periods > > >> of vowels as a pre-process to an FFT, which seems to give better results. > > >> Are there additional steps that could be taken to improve a speech > > >> spectrogram? > > > >> Bill > > > > Google pitch-synchronous speech processing > > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > > Better yet, forget about Fourier Transform altogether > > > > Why ? > > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > > poor spectral resolution at low frequencies etc. etc. etc. > > > > Google "Fast Cochlea Transform" > > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > ...> > typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise > a specific filter nabk. > > ChrisAudience is consistent in their use of cochlea not cochlear in their references to their 'proprietary Fast Cochlea Transform', but it's still just advertising copy. In fact, they claim a trade mark on 'Fast Cochlea Transform'. Dale B Dalrymple
Reply by ●December 30, 20102010-12-30
On Dec 30, 5:10�am, Chris Bore <chris.b...@gmail.com> wrote:> On Dec 29, 10:23�pm, Richard Owlett <rowl...@pcnetinc.com> wrote: > > > > > > > fatalist wrote: > > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > > wrote: > > >> Dale has posted several good links in the few days I've been reading this > > >> forum. �The constant Q DFT paper looks like it might have some value in > > >> generating speech spectrograms. �Does anyone have experience with this? > > >> Are there other DFT algorithms I should be looking at? > > > >> The common approach used in the open source packages I've read (Praat, > > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > > >> � I've found significant benefit in time-aliasing two adjacent pitch periods > > >> of vowels as a pre-process to an FFT, which seems to give better results. > > >> Are there additional steps that could be taken to improve a speech > > >> spectrogram? > > > >> Bill > > > > Google pitch-synchronous speech processing > > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > > Better yet, forget about Fourier Transform altogether > > > > Why ? > > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > > poor spectral resolution at low frequencies etc. etc. etc. > > > > Google "Fast Cochlea Transform" > > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > > > Neither that article, nor any of the Google links I checked > > define an FCT. Saying that it "is not an FFT" is as exactly as > > true an and as useful as saying "a mackintosh is not an apple" or > > "a cobbler has little to do with boots". > > typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise > a specific filter nabk. > > Chris > ======================== > Chris Bore > BORES Signal Processingwww.bores.com- Hide quoted text - > > - Show quoted text -More like a filter cascade with several hundred logarithmically spaced filters performing progressive low-pass filtering and downsampling
Reply by ●December 30, 20102010-12-30
dbd wrote:> On Dec 30, 2:10 am, Chris Bore<chris.b...@gmail.com> wrote: >> On Dec 29, 10:23 pm, Richard Owlett<rowl...@pcnetinc.com> wrote: >> >> >> >>> fatalist wrote: >>>> On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> >>>> wrote: >>>>> Dale has posted several good links in the few days I've been reading this >>>>> forum. The constant Q DFT paper looks like it might have some value in >>>>> generating speech spectrograms. Does anyone have experience with this? >>>>> Are there other DFT algorithms I should be looking at? >> >>>>> The common approach used in the open source packages I've read (Praat, >>>>> libsnack), is to use a short-time FFT, with a window function like Hamming. >>>>> I've found significant benefit in time-aliasing two adjacent pitch periods >>>>> of vowels as a pre-process to an FFT, which seems to give better results. >>>>> Are there additional steps that could be taken to improve a speech >>>>> spectrogram? >> >>>>> Bill >> >>>> Google pitch-synchronous speech processing >> >>>> A good pitch + voiced/unvoiced speech detector is a prerequisite >> >>>> Better yet, forget about Fourier Transform altogether >> >>>> Why ? >> >>>> Time-frequency resolution tradeoff, linear frequency scale leading to >>>> poor spectral resolution at low frequencies etc. etc. etc. >> >>>> Google "Fast Cochlea Transform" >> >>>> http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... >> ...May I reinsert what I said? I will anyway ;/>> Neither that article, nor any of the Google links I checked define an FCT. >> Saying that it "is not an FFT" is as exactly as true an and as useful as saying >> "a mackintosh is not an apple" or "a cobbler has little to do with boots".BTW, "mackintosh" is *not* misspelled above.> >> >> typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise >> a specific filter nabk. >> >> Chris > > Audience is consistent in their use of cochlea not cochlear in their > references to their 'proprietary Fast Cochlea Transform', but it's > still just advertising copy. > > In fact, they claim a trade mark on 'Fast Cochlea Transform'. > > Dale B DalrympleSnicker.
Reply by ●December 31, 20102010-12-31
On Dec 30, 12:08�pm, dbd <d...@ieee.org> wrote:> On Dec 30, 2:10�am, Chris Bore <chris.b...@gmail.com> wrote: > > > > > On Dec 29, 10:23�pm, Richard Owlett <rowl...@pcnetinc.com> wrote: > > > > fatalist wrote: > > > > On Dec 29, 9:26 am, "waywardgeek"<waywardgeek@n_o_s_p_a_m.gmail.com> > > > > wrote: > > > >> Dale has posted several good links in the few days I've been reading this > > > >> forum. �The constant Q DFT paper looks like it might have some value in > > > >> generating speech spectrograms. �Does anyone have experience with this? > > > >> Are there other DFT algorithms I should be looking at? > > > > >> The common approach used in the open source packages I've read (Praat, > > > >> libsnack), is to use a short-time FFT, with a window function like Hamming. > > > >> � I've found significant benefit in time-aliasing two adjacent pitch periods > > > >> of vowels as a pre-process to an FFT, which seems to give better results. > > > >> Are there additional steps that could be taken to improve a speech > > > >> spectrogram? > > > > >> Bill > > > > > Google pitch-synchronous speech processing > > > > > A good pitch + voiced/unvoiced speech detector is a prerequisite > > > > > Better yet, forget about Fourier Transform altogether > > > > > Why ? > > > > > Time-frequency resolution tradeoff, linear frequency scale leading to > > > > poor spectral resolution at low frequencies etc. etc. etc. > > > > > Google "Fast Cochlea Transform" > > > > >http://www.audience.com/download_files/Instantaneous%20Noise%20Suppre... > > ... > > > typo. It is Fast Cochlear Transform (Cochlear with an r) or otherwise > > a specific filter nabk. > > > Chris > > Audience is consistent in their use of cochlea not cochlear in their > references to their 'proprietary Fast Cochlea Transform', but it's > still just advertising copy. > > In fact, they claim a trade mark on 'Fast Cochlea Transform'. > > Dale B DalrympleThat's actually quite funny. Isn't cochlea (without an r) a noun, while cochlear (with an r) means 'having to do with the cochlea'? As in the cochlea (the organ in the ear) but the cochlear nerve, duct etc. And does that mean that the patents on the Fast Cochlear Transform (with an r) do not violate those on the Fast Cochlea Transorm (without an r)?
Reply by ●January 2, 20112011-01-02
Originally, I had thought a logarithmic transform DFT, the Fast Cochlea Transform, would be superior to a linear transform like the FFT. I've since become more convinced that traditional FFT based spectrograms are closer to the right way to go. The main reason is that vowels are highly harmonic, and any spectrogram that samples in between harmonics of the fundamental pitch will show deep horizontal valleys. Tracking formant frequencies becomes harder, as they travel down through these valleys and over the ridges of the harmonics, independently of the fundamental pitch. Another problem in tracking formants is the vertical ridges created by short-time FFTs, which are in sync with the fundamental pitch. What I'm doing instead is to first determine the pitch, and then do a 2-block time-aliased FFT of two pitch periods, currently with a Hann window. This does several things for me. I need to track the fundamental pitch anyway, as it contains word boundary information and other prosodic information. It eliminates the valleys created by sampling in between harmonics. It eliminates the vertical ridges created by the short-time FFT. It also eliminates the common repeating pin-hole patterns in short-time spectrograms. As much of the energy is harmonic, and harmonics cause zero spectral leakage, spectral noise is reduced. Finally, computation is dramatically reduced, as the FFT window is much shorter than the usual 25-ish ms short-time FFT window, and I take full pitch period sized steps. I think all of this will help me match spectrograms in a speech recognition and TTS systems. However, I'm still new to this field, and would appreciate any tips on cleaning up voice spectrograms. Bill
Reply by ●January 3, 20112011-01-03
On Jan 2, 4:23=A0am, "waywardgeek" <waywardgeek@n_o_s_p_a_m.gmail.com> wrote:> Originally, I had thought a logarithmic transform DFT, the Fast Cochlea > Transform, would be superior to a linear transform like the FFT. =A0I've > since become more convinced that traditional FFT based spectrograms are > closer to the right way to go. > > The main reason is that vowels are highly harmonic, and any spectrogram > that samples in between harmonics of the fundamental pitch will show deep > horizontal valleys. =A0Tracking formant frequencies becomes harder, as th=ey> travel down through these valleys and over the ridges of the harmonics, > independently of the fundamental pitch. > > Another problem in tracking formants is the vertical ridges created by > short-time FFTs, which are in sync with the fundamental pitch. > > What I'm doing instead is to first determine the pitch, and then do a > 2-block time-aliased FFT of two pitch periods, currently with a Hann > window. =A0This does several things for me. =A0I need to track the fundam=ental> pitch anyway, as it contains word boundary information and other prosodic > information. =A0It eliminates the valleys created by sampling in between > harmonics. =A0It eliminates the vertical ridges created by the short-time > FFT. =A0It also eliminates the common repeating pin-hole patterns in > short-time spectrograms. =A0As much of the energy is harmonic, and harmon=ics> cause zero spectral leakage, spectral noise is reduced. =A0Finally, > computation is dramatically reduced, as the FFT window is much shorter th=an> the usual 25-ish ms short-time FFT window, and I take full pitch period > sized steps. > > I think all of this will help me match spectrograms in a speech recogniti=on> and TTS systems. =A0However, I'm still new to this field, and would > appreciate any tips on cleaning up voice spectrograms. > > BillHow can you do FFT of two full pitch periods (without signal re- sampling or zero-padding) when each pitch period contains a variable number of signal samples (it can be a fractional number) ? What is a "pitch period" ? Where does it start and where does it end ? These are not simple questions And BTW, current "state-of-the-art" statistical speech recognition paradigm is crap anyway, so you don't have to worry too much about it...






