DSPRelated.com
Forums

Pitch detection

Started by altmeyermartin March 21, 2005
robert bristow-johnson wrote:
   ...

> whatever linear operation you do in the frequency domain can be constructed > as an equivalent time-domain operation. > > i guess we need to talk a little about what "pitch detection" means. we > have a perceptual meaning that is hard to describe for sounds in general. > if i recorded a fart into a sampling keyboard and then played on the keys a > recognizable melody (say "Mary had a little lamb"), you might likely hear a > sense of pitch for each note, but i would have trouble defining clearly how > that fart gives you a sense of pitch for that note. > > but highly tonal instruments are different. *then* we are pretty clear that > the pitch of the note is directly related to the fundamental frequency, f0, > of the quasi-periodic function that is the note's waveform (which is the > reciprocal of the period).
The B-flat woodpecker woke me this morning by tapping out a Personals ad on my furnace vent. With all that din, it was still B-flat. ... Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
In article <1112887101.613550.274560@z14g2000cwz.googlegroups.com>,
Andor <an2or@mailcircuit.com> wrote:

> I'm also no speech processing guy (perhaps you should go over to > comp.speech or comp.speech.research for more qualified responses). I > know that speech processing people like to estimate the frequency > components of speech indirectly by computing LPC coefficients (that is > what vocoders such as MELP, CELP etc. do). These coefficients do not > give you frequency information directly (you have to factor the > polynomial first), but it seems that certain vocal tract parameters can > be deduced directly from these coefficients (without having to take it > from the frequency domain). I think modeling the vocal tract should be > more interesting than frequency estimation for speech recognition.
hmm, that's interesting -- don't know what LPC is nor what vocal tract parameters might be exactly -- but it's interesting that you're saying that speech sound might be processed, right at the start, in a fundamentally different way than other sounds. as i said, i know there's much more and a lot to speech recognition, but i thought that that "much more" part would come after the initial processing of the sound -- i thought whether you're processing music or speech or whatever i thought you'd probably (basically and fundamentally that is -- not exactly -- you'd probably make modifcations) handle the sound initially in much the same way. so thanks for pointing out that that may not be the case. something to look into. although i like the way the human ear deals with music and speech and other things, and that is basically a frequency splitter-upper isn't it? i know there's nothing dictating that you should copy nature and not take short cuts where you see them but anyway... i suppose also i'd like to not restrict what i'm doing to just speech recognition -- not cut off other possibilities althoug speech recognition is the main goal at the moment. but if there's other information to be gleaned by using other than frequency detection then thye should be used for sure. in fact what you're talking about, LPC and vocal tract parameters and maybe other things, do you think that if i were to only use frequency analysis (splitting the sound up into frequencies etc) i'd not be able to get the info that the methods you talk about would give -- i'd miss info?, or are the methods you talk about a short cut, a more efficient way to get info you would still be able to get via splitting sound into frequencies? probably something to ask on the speech recognition list but i'd be interested in any opinions on that here. (speech recognition people are probably more likely to "sell" their ways possibly, so nice to get other opinions). if the ear does basically just split into frequencies then i reckon what you're talking about is a short cut, and not something that'd give extra info that couldn't be got from just frequency splitting -- but i'm guessing.
> If that is not the case, pure frequency estimation can be done via FFT > --- consider windowing, averaging and overlapping to improve the raw > FFT data.
yes they're the kind of details i was skipping over when i said "without going into the details" (although oversampling isn't something i know about yet). as well as what you mention i thought of different lengthed windows for different tones (short for high pitched, long for bass-like), so different fourier transforms specifically for particular (small) ranges of tones -- would generally allow more accurate time info to be got i think (although no difference for the lowest tone). and lots of overlapping as you said (both time wise (shifting the windows along by a small amount each time) and frequency wise). cool, thanks. ben.
In article <BE7AC7B8.600A%rbj@audioimagination.com>, robert
bristow-johnson <rbj@audioimagination.com> wrote:

> in article 070420051129510384%x@x.x, ben at x@x.x wrote on 04/07/2005 06:30: > > > In article <BE78340D.5EE0%rbj@audioimagination.com>, robert > > bristow-johnson <rbj@audioimagination.com> wrote: > > > >> i wouldn't do that. you still need to deal with the possibility of missing > >> or weak harmonics (inc. fundamental). i agree with Dmitry Terez about not > >> using FFT. > > ...[to detect pitch] > > > > is this correct? when you say not to use fft i take it you mean fourier > > transforms in general not just the fast fourier transform specifically? > > it's a little bit of a shock to me this -- i asked elsewhere about > > pitch detection (which is the same thing as frequency or tone alanlysis > > / extraction right?) and was told 'fourier transform' by numerous > > people -- that seemed to be *the* one and only answer. > > whatever linear operation you do in the frequency domain can be constructed > as an equivalent time-domain operation. > > i guess we need to talk a little about what "pitch detection" means. we > have a perceptual meaning that is hard to describe for sounds in general. > if i recorded a fart into a sampling keyboard and then played on the keys a > recognizable melody (say "Mary had a little lamb"), you might likely hear a > sense of pitch for each note, but i would have trouble defining clearly how > that fart gives you a sense of pitch for that note. > > but highly tonal instruments are different. *then* we are pretty clear that > the pitch of the note is directly related to the fundamental frequency, f0, > of the quasi-periodic function that is the note's waveform (which is the > reciprocal of the period).
i'm not quite sure i really see the difference -- we're talking about the size of gaps between each spike for both things (the fart sample and musical instrument) -- so what's the difference? the frequency of spikes make a pitch right? lots of spikes per second -- high pitched sound. sure, the fart sample is a bit rougher and maybe gappier but it's still the same situation isn't it? i'm definetely not seeing the difference between frequency and pitch but i don't think i'm that fussed about the difference (although it is very interesting -- certainly got me thinking). maybe it is important though, i'm not sure. anyway, when i said i asked elsewhere about pitch detection i actually asked elsewhere about frequency detection -- it's just since seeing this thread that the issue of the difference between pitch and frequency has arisen and got merged together because i wasn't aware of the difference. i think it's frequency detection i'm interested in although seeing as i don't understand the difference i'm pretty unsure. (i want to pre process sound -- split into the various frequencies -- in order to go on and further process for speech recognition)
> now, in the spectrum, you will see spikes that are equally spaced and > integer multiples of that fundamental frequency.
this is the raw data that you get from a sound right? time going horizontally and amplitude vertically as it's usually illustrated.
> each spike represents a > harmonic and the height of if is the strength of that harmonic. now, we > could use a comb filter to isolate those spikes. there are two basic kinds > of comb filters, one that puts in a null every f1 Hz and one that puts in a > peak every f1 Hz. now if we use the first one and vary f1 until it happens > upon f0 or a submultiple of f0, then the output of that comb filter will be > minimum. that is essentially what the AMDF or ASDF algorithm does in the > time domain. it's the same thing but in two different domains. and > autocorrellation is directly related to the ASDF.
i thought fourier transforms were what's used to transform those spikes (i think of them as fence posts, but i'm silly) into something more comprehendable / recognisable? how is AMDF and ASDF different / related to a fourier transform? are they completely different? or are based on, maybe even varients of fourier transforms? to continue with the description of sound and extracting frequencies which i find *very* helpful: the width of the window necessary to be able to see a frequency is at least the width of two fence posts together (maybe more) for the particular frequency you're looking at, so you can't tell the frequency/tone from just one fence post (because it's the relation between fence posts that make a frequency -- a single fence post isn't a frequency -- hense multiple toothed comb in your explenation). so the amount of fence posts in a particular stretch says the frequency/tone, and the height of the fence posts says the volume. and one main problem is, i can imagine, various styles of fences (frequencies) overlap in most sounds -- occur within the same stretch of time. so that's when you need more than two fence posts to be able to tell which fence posts belong to which fence -- to be able to see/determine the continuation. the more repetition over a longer stretch the easier it is to be sure you're correctly ascertaining the frequency and not just going of a mixture of frequencies so getting incorrect frequency data. yes the literal explenation of how to go about extracting frequencies that you give completely tallies with how i imagined logically you might go about getting frequencies out of raw audio data -- but what gets me is, where on earth does the fourier transform come into this? (i don't have a nice simple logical understanding of how a fourier transform does what it does at all, unless it happens to be what's just been described maybe? the comb etc?) i thought time/amplitude data >>> frequency data was fourier transform's teritory. ft transforms back and forth between raw data (time by amplitude, without apparent frequency data) and spectrum data(not sure on that phrase -- frequency by amplitude, without apparent time data). so should i drop reading and learning about fourier transforms (bearing in mind i want to extract the various frequencies that occur in sound (mainly but not entirely for speech recognition)) and concentrate on AMDF and ASDF ? or are they much the same / similar things anyway?
> well, splitting a sound into its frequencies certainly *is* a topic > regarding the Fourier Transform (in one of its forms).
so i reckon AMDF and ASDF are versions of the fourier transform? thanks very much for the reply, ben.
In article <KPidnYdVc-g908jfRVn-tg@rcn.net>, Jerry Avins <jya@ieee.org>
wrote:

> ben wrote: > > ... > > > is this correct? when you say not to use fft i take it you mean fourier > > transforms in general not just the fast fourier transform specifically? > > To touch on one point only. An FFT is just a fast way to compute a > Fourier transform. If its result weren't identical to all the other ways > of computing it, it wouldn't be a Fourier transform.
right i see. i did think there were very similar -- didn't know they gave exactly the same results. i was just being careful and making sure. thanks, ben.
ben wrote:

   ...

> i'm not quite sure i really see the difference -- we're talking about > the size of gaps between each spike for both things (the fart sample > and musical instrument) -- so what's the difference? the frequency of > spikes make a pitch right? lots of spikes per second -- high pitched > sound. sure, the fart sample is a bit rougher and maybe gappier but > it's still the same situation isn't it? i'm definetely not seeing the > difference between frequency and pitch but i don't think i'm that > fussed about the difference (although it is very interesting -- > certainly got me thinking). maybe it is important though, i'm not sure. > > anyway, when i said i asked elsewhere about pitch detection i actually > asked elsewhere about frequency detection -- it's just since seeing > this thread that the issue of the difference between pitch and > frequency has arisen and got merged together because i wasn't aware of > the difference. i think it's frequency detection i'm interested in > although seeing as i don't understand the difference i'm pretty unsure.
Jon Harris wrote earlier: <quote> "robert bristow-johnson" <rbj@audioimagination.com> wrote: ... >> i think your brain sorta fills in the missing fundamental if there is >> a 2nd, 3rd, 4th, etc harmonic of a tone. try it with MATLAB or the >> code of your choice. Or check out these examples on the web: http://www.ee.calpoly.edu/~jbreiten/audio/missfund/ http://physics.mtsu.edu/~wmr/julianna.html <endquote> Did you listen to those examples, particularly the ones at the second URL? Some clearly illustrate a *pitch* an octave lower than the lowest *frequency* in the audio. ... Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Jerry Avins wrote:
> ben wrote: > > ... > >> is this correct? when you say not to use fft i take it you mean fourier >> transforms in general not just the fast fourier transform specifically? > > > To touch on one point only. An FFT is just a fast way to compute a > Fourier transform.
To be more specific, a finite length, discrete time Fourier transform. :-) Bob -- "Things should be described as simply as possible, but no simpler." A. Einstein
Bob Cain wrote:

> Jerry Avins wrote: > > > To touch on one point only. An FFT is just a fast way to compute a > > Fourier transform. > > To be more specific, a finite length, discrete time Fourier > transform. :-)
Implemented in a computationally efficient manner. Otherwise, "finite length" etc. just says DFT, not necessarily FFT. Ciao, Peter K.
Bob Cain wrote:
> > > Jerry Avins wrote: > >> ben wrote: >> >> ... >> >>> is this correct? when you say not to use fft i take it you mean fourier >>> transforms in general not just the fast fourier transform specifically? >> >> >> >> To touch on one point only. An FFT is just a fast way to compute a >> Fourier transform. > > > To be more specific, a finite length, discrete time Fourier transform. :-) > > > Bob
True. To be even more specific, a finite length, discrete time Fourier transform with quantized results and some round-off error. Is there another kind of Fourier transform that can be performed digitally? Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
In article <1112764006.059594.279510@z14g2000cwz.googlegroups.com>,
robert bristow-johnson <rbj@audioimagination.com> wrote:
>i don't get it. why would getting the higher harmonics and then >dividing the frequency down to an implied missing fundamental be better >than determining the period of some quasi-periodic signal for measure >of pitch?
If the spectral energy peaks are very closely but not exactly harmonically related (which the physics of some real-world resonators can produce), a sub-multiple of the lowest frequency might be what a human would call the approximate pitch, but a sub-multiple of an even higher frequency present might be what a musician would call the exact pitch relative to other simultaneous musical notes present. A good example might be a spinet piano, where a slightly flat low-A (say 109.8 Hz) played through a teleco quality circuit might only have frequency content above 200 Hz, but would still be heard as a low-A, two octaves below concert-A, in appropriate context, even with little spectral energy in that range. But if the near 4th harmonic peaked at 440.8 Hz, and this waveform was played against a simultaneous exact 440 Hz concert-A flute tone, thus producing a noticeable beat, the low-A piano note might be perceived as slightly #sharp in pitch, not flat. Humans may also be more sensitive to pitch errors in the middle of a the audio spectrum, versus in the lower or higher frequency ranges. Thus the pitch in the above situation, to a piano tuner, might be best considered as closer to 440.8/4 = 110.2 Hz, and neither, say, at 220 Hz, where there might be the highest absolute spectral peak (according to an FFT maxima), nor at the fundamental 109.8 Hz string resonance that started off this overtone sequence (and which an AMSD or autocorrelation algorithm might hunt and find). IMHO. YMMV. -- Ron Nicholson rhn AT nicholson DOT com http://www.nicholson.com/rhn/ #include <canonical.disclaimer> // only my own opinions, etc.
in article d39gmo$vqr$1@blue.rahul.net, Ronald H. Nicholson Jr. at
rhn@mauve.rahul.net wrote on 04/09/2005 17:16:
 
> If the spectral energy peaks are very closely but not exactly harmonically > related (which the physics of some real-world resonators can produce), > a sub-multiple of the lowest frequency might be what a human would call > the approximate pitch, but a sub-multiple of an even higher frequency > present might be what a musician would call the exact pitch relative to > other simultaneous musical notes present. > > A good example might be a spinet piano, where a slightly flat low-A > (say 109.8 Hz)
what is 109.8 Hz? is it the frequency of the bottom overtone (often called the fundamental)? or is it the reciprocal of the period? especially in the situation you describe below, they are not exactly the same thing. the AMDF or ASDF measures the period.
> played through a teleco quality circuit might only have > frequency content above 200 Hz, but would still be heard as a low-A, > two octaves below concert-A, in appropriate context, even with little > spectral energy in that range.
yup. and the measured period will be about 1000/109.8 milliseconds. but possibly not exactly.
> But if the near 4th harmonic peaked at 440.8 Hz,
you mean there's a formant (or resonance) at around 440 Hz making the 4th harmonic particularly loud compared to others? that will increase its influence on the measured period.
> and this waveform was played against a simultaneous exact 440 > Hz concert-A flute tone, thus producing a noticeable beat, the low-A > piano note might be perceived as slightly #sharp in pitch, not flat.
that may be true, but i am not sure that the AMDF will see it any differently. especially if the 109.8 Hz component was killed by an HPF, then the period *will* be determined as the greatest common factor of the remaining harmonics and if they are sharper than their integer harmonic index times the 109.8 Hz component, the AMDF will arrive at a pitch that is higher than 109.8.
> Humans may also be more sensitive to pitch errors in the middle of > a the audio spectrum, versus in the lower or higher frequency ranges.
that may be, but is still not the issue. just like for a VU meter, you could run the audio through something like an A-weighting filter to emphasize frequency components in the 2 to 5 kHz range and de-emphasize components in the highest and lowest octaves before the AMDF algorithm see it.
> Thus the pitch in the above situation, to a piano tuner, might be best > considered as closer to 440.8/4 = 110.2 Hz, and neither, say, at 220 Hz, > where there might be the highest absolute spectral peak (according to > an FFT maxima), nor at the fundamental 109.8 Hz string resonance that > started off this overtone sequence (and which an AMDF or autocorrelation > algorithm might hunt and find).
no. the AMDF or ASDF will find the best fit for the period, which is influenced by all of the harmonics, and the harmonics greater in amplitude will influence the measure more. the reciprocal of that would be called the fundamental frequency, but it might not be exactly the same frequency as the 1st harmonic. as in the case above, if there was zero amplitude at 109.8 (i dunno what meaning that precise frequency would have) but a decent amount of energy at 220, 330.3, 440.8, 551.5, the AMDF will not measure a period of 1/109.8, but will be shorter than 1/110 because of the other harmonics. i know about sharpened harmonics in many fixed string instruments with increasing harmonic number (due to stiffness at the string termination that effectively shortens the string, particularly for high amplitude hits). i know that piano tuners may very well tune higher notes slightly sharp, in comparison to their mathematical value in an equally tempered scale to line up octaves to power of 2 harmonics from lower notes. for 12 note/octave equal temperament, we don't line up the other harmonics, say the 3rd to exactly 19 semitones up because 3 does not exactly equal 2^(19/12). i know about some tones possibly having missing fundamental (and possibly other harmonics). it's also possible, that the fundamental, even when it is there, does not exactly equal the reciprocal of the measured period, because of the aggregate influence of the other harmonics. that doesn't change anything. for a tonal musical note, they are quasi-periodic and, for those kinds of notes, our most salient queue for pitch will the reciprocal of the period and the AMDF or ASDF is designed to best estimate that period. now there are problems. there is the classic "octave problem" (but it could be with other harmonic intervals, too, but most often, if there is an ambiguity, it's about an octave). this come from the fact that a 110 Hz note that is added to a *very* quiet 55 Hz note (say, at -80 dB relative to the 110 Hz note), will look like a 55 Hz note mathematically, but will sound like a 110 Hz note. then there needs to be a little brains built into the AMDF analysis to reject the null at 1/55 sec just because it is ever so slightly lower than the null at 1/110 sec. so somehow you want to choose the first really good looking null, even if the null at twice the lag is very slightly better. that's the main problem with AMDF or ASDF. i don't see the situation you described as being a problem. if you have a good (and short) sound file of a note or even just a collection of amplitudes and frequencies that you think would fool this, i might want to try it with a MATLAB kludge to see if it does. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."