DSPRelated.com
Forums

Sound analysis

Started by Shimon M September 2, 2009
Hello, Everyone,

I am designing a software for a course, and it is suppose to do the
following:
1. get a wave file
2. analyze it - to notes and instruments that play them
3. enable changing an instrument with another.

Yes, I know these are all things that are under research, and that there
is no magic solution.

Yet, since this is for a course, it doesn't have to be robust - if I can
get it to work only on a small group of samples, that would be OK. So I
chose GuitarPro (which I hope you know). It basically allows one to write
notes for different instruments and play them. After I write the music, I
export it as a MIDI file and then I use a MIDI to WAVE converter. This is
how I get my samples.

Now, I am using an FT (fourier transform) algorithm that appears on:
http://jvalentino2.tripod.com/dft/index.html (it uses a regular FT, not
even FFT)
The major problem I am having is that when I create just a simple "C" note
played by a flute or a guitar-> then export it as MIDI -> convet it to WAVE
-> use the code above on it: I don't get 216.626 Hz as the dominant
frequency!!! Sometimes, 216.626 gets even a very small amplitude.

The code is fairly basic and uses the theoratical FT pretty much as it is,
so i don't think that that's the problem.

So Question 1:--------------------------------------------------
What am I doing wrong? - i expected that the "C" frequency would be one of
the leading ones, but the results I am getting seem almost random.
Is there any problem, you think, with the MIDI export or the WAVE
coversion. Maybe it is the code i am using after all?

Maybe there is nothing wrong - and this is the way it is supposed to be,
and if so could you give me and advice on how to somehow still understand
the note played.
-----------------------------------------------------------------

If you could just answer question number 1, I'd be very greatful, but if
some can continue reading, that would be fantastic:

I thought I could overcome the problem above - I assumed each instrument
as its own fingerprint - that is when a guitar plays C it may also play E,
G at different energies - and the list of energies for each frequency is
some sort of fingerprint of a guitar.

So I designed an algorithm that:
1. Divides a song (Wave file) to segmants, each of less than 0.2 seconds.
I want to analyze each such segmant.
2. At a certain segmant, let's say A - some instruments could be playing
some notes.
3. on segmant A, the algorithm uses an FFT and finds the significant
fourier coefficeints. Let's call this set of notes: notes(A).
4. I assuemd that at least some of these fourier coefficents are "real"
that is represent actual notes played.
5. Now, I also need to support only 3 instruments (guitar, piano,
flute)!!! So, I retrive from a bank of samples how each of the 3
instruments play each of notes(A):
this forms a matrix:

(let's assume notes(A) is {C,E} so the matrix looks like this:
  Guitar_on_C | Guitar_on_E | Flute_on_C | Flute_on_E | piano_C | piano_E
C    from bank  from bank     from bank
E    from bank...                                                 from
bank

that is, for cell [Guitar_on_C, C] it will contain the bank's C frequency
when C is played by a guitar.
On [Guitar_on_C, E] - the E frequency when the guitar plays C etc.

Then, I add a few more rows for let's say notes B,A#,F,G that were not
found significant - this is to make the matrix a square matrix (6 columns,
6 rows).

and finally, let's call the matrix M, and the sampled notes at frequncies
<C,E,B,A#,F,G>, we shall call Y, so:
I look for the solution x of the linear equation system Mx = Y.

If all my assumption were correct, I could deduce from the solution x, the
contrubution of each instrument & note to the segmant I am analyzing.
I could discard those that have very small contribution - saying it's a
numerical error and those that have a big contribution are probably really
playing.

So, Quesion 2:------------------------------------------
What do you think of the division made at (1) into segmants of 0.2
seconds. This was an idea of a professor of mine (he isn't in the field of
DSP), in order to avoid needing to identify were the note begins and ends.
-------------------------------------------------------
Question 3:
what do you think of the algorithm - could it work?
-------------------------------------------------------
Question 4:
The algorithm is based on the idea that if I play C note at volume 16, and
at the same time E note at volume 17 - then the fourier transform over for
this - Is just like:

the sum of 2 fourier transfroms:
1. C note on guitar at volume 16.
2. E note on huitar at volume 17.

Is this a right assumption?
-------------------------------------------------------

Thank you for all your help!
I will be extremly greatful to anyone who responds even on some of the
questions.


On 2 Sep, 13:46, "Shimon M" <shimon.ma...@gmail.com> wrote:
> Hello, Everyone, > > I am designing a software for a course, and it is suppose to do the > following: > 1. get a wave file > 2. analyze it - to notes and instruments that play them > 3. enable changing an instrument with another. > > Yes, I know these are all things that are under research, and that there > is no magic solution. > > Yet, since this is for a course, it doesn't have to be robust - if I can > get it to work only on a small group of samples, that would be OK.
When something is 'subject of current research' even this limited goal might be too optimistic.
> So I > chose GuitarPro (which I hope you know). It basically allows one to write > notes for different instruments and play them. After I write the music, I > export it as a MIDI file and then I use a MIDI to WAVE converter. This is > how I get my samples. > > Now, I am using an FT (fourier transform) algorithm that appears on:http://jvalentino2.tripod.com/dft/index.html(it uses a regular FT, not > even FFT)
Use the FFT. The FFT is a standard tool that everybody understand how works.
> The major problem I am having is that when I create just a simple "C" note > played by a flute or a guitar-> then export it as MIDI -> convet it to WAVE > -> use the code above on it: I don't get 216.626 Hz as the dominant > frequency!!! Sometimes, 216.626 gets even a very small amplitude.
This can have any number of explanations. If you had used the FFT, you would have one less dubious factor to worry about. As is, it could be a problem with the FT algorithm you chose.
> The code is fairly basic and uses the theoratical FT pretty much as it is, > so i don't think that that's the problem.
Don't 'think' it's no problem. Make sure. Use the FFT.
> So Question 1:-------------------------------------------------- > What am I doing wrong? - i expected that the "C" frequency would be one of > the leading ones, but the results I am getting seem almost random. > Is there any problem, you think, with the MIDI export or the WAVE > coversion. Maybe it is the code i am using after all?
The FT? Could well be. Or it could be that MIDI uses some format that is tailored to the human auditory system. If you had used the FFT from the start, you would have had one less uncertain factor on your list.
> Maybe there is nothing wrong - and this is the way it is supposed to be, > and if so could you give me and advice on how to somehow still understand > the note played. > -----------------------------------------------------------------
If you mean 'how the human auditory system works,' that's anybody's guess.
> If you could just answer question number 1, I'd be very greatful, but if > some can continue reading, that would be fantastic: > > I thought I could overcome the problem above - I assumed each instrument > as its own fingerprint - that is when a guitar plays C it may also play E, > G at different energies - and the list of energies for each frequency is > some sort of fingerprint of a guitar.
Again an *assumption* on your part. Don't assume. Make sure where you can; verify what remains.
> So I designed an algorithm that: > 1. Divides a song (Wave file) to segmants, each of less than 0.2 seconds. > I want to analyze each such segmant. > 2. At a certain segmant, let's say A - some instruments could be playing > some notes. > 3. on segmant A, the algorithm uses an FFT and finds the significant > fourier coefficeints. Let's call this set of notes: notes(A). > 4. I assuemd that at least some of these fourier coefficents are "real" > that is represent actual notes played. > 5. Now, I also need to support only 3 instruments (guitar, piano, > flute)!!! So, I retrive from a bank of samples how each of the 3 > instruments play each of notes(A): > this forms a matrix: > > (let's assume notes(A) is {C,E} so the matrix looks like this: > &#4294967295; Guitar_on_C | Guitar_on_E | Flute_on_C | Flute_on_E | piano_C | piano_E > C &#4294967295; &#4294967295;from bank &#4294967295;from bank &#4294967295; &#4294967295; from bank > E &#4294967295; &#4294967295;from bank... &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; from > bank > > that is, for cell [Guitar_on_C, C] it will contain the bank's C frequency > when C is played by a guitar. > On [Guitar_on_C, E] - the E frequency when the guitar plays C etc. > > Then, I add a few more rows for let's say notes B,A#,F,G that were not > found significant - this is to make the matrix a square matrix (6 columns, > 6 rows). > > and finally, let's call the matrix M, and the sampled notes at frequncies > <C,E,B,A#,F,G>, we shall call Y, so: > I look for the solution x of the linear equation system Mx = Y. > > If all my assumption were correct, I could deduce from the solution x, the > contrubution of each instrument & note to the segmant I am analyzing. > I could discard those that have very small contribution - saying it's a > numerical error and those that have a big contribution are probably really > playing. > > So, Quesion 2:------------------------------------------ > What do you think of the division made at (1) into segmants of 0.2 > seconds. This was an idea of a professor of mine (he isn't in the field of > DSP), in order to avoid needing to identify were the note begins and ends. > -------------------------------------------------------
It's an arbitrary choise. What's wrong with 0.25? Or 0.15?
> Question 3: > what do you think of the algorithm - could it work? > -------------------------------------------------------
Depends on what your objective is. It would probably work if the objective is to demonstrate that the spectra of different instruments that play the same notes in isolation, are different. If you want to extract the contributions from each instrument to a song, you might be out of luck.
> Question 4: > The algorithm is based on the idea that if I play C note at volume 16, and > at the same time E note at volume 17 - then the fourier transform over for > this - Is just like: > > the sum of 2 fourier transfroms: > 1. C note on guitar at volume 16. > 2. E note on huitar at volume 17. > > Is this a right assumption? > -------------------------------------------------------
Sure. The DFT is linear. Rune

Shimon M wrote:
> Hello, Everyone, > > I am designing a software for a course, and it is suppose to do the > following: > 1. get a wave file > 2. analyze it - to notes and instruments that play them > 3. enable changing an instrument with another. >
<gulp> What is the pedagogical purpose of this? What academic level? (2) (and (3) from it) is a formidable problem for polyphonic sources (see MPEG-7, Blind Source Separation etc)
>.. > The major problem I am having is that when I create just a simple "C" note > played by a flute or a guitar-> then export it as MIDI -> convet it to WAVE > -> use the code above on it: I don't get 216.626 Hz as the dominant > frequency!!! Sometimes, 216.626 gets even a very small amplitude. >
Hope that's a typo - should be 261.626. Taking the lowest partial of a tone as the fundamental pitch works a lot of the time, but not all the tiume. Many oboe notes for example have a 2nd harmonic stronger than the fundamental. No time to answer everything else just now; but I wonder why you feel the need to design all this yourself. How much time do you have? The new academic year is virtually started already. For example, does this software not do almost all of what you need (and is GPL too)? http://www.sonicvisualiser.org Or even the new Melodyne, which claims polyphonic note separation among other things (I don't have it, haven't tested it). Richard Dobson
Thank you both for your comments.

>Rune said: >Use the FFT. The FFT is a standard tool that everybody >understand how works. >
Thank you. I just used the FFT on www.fftcalculator.com I got similar results :( However, the way I produce data from a wave file is AudioInputStream and SourceDataLine - they produce an array of int(s)!!! So, the data I preformed FFT on looks like this: 10503 45207 15203 310752 ..
>Is this the way it is supposed to be? Maybe this is part of the problem? >If you mean 'how the human auditory system works,' that's >anybody's guess. >
No, I meant how do I deduce from FFT results the note played, even if perhaps the FFT as I see it doesn't give me any real information - because the frequency I expected would have a high value, has an average value.
>Again an *assumption* on your part. Don't assume. Make sure >where you can; verify what remains. >
You are correct. I meant the word "assumption" as a general "what should be" in terms of theory. But as we all know - theory != reality. So in theory, a guitar's C note differs from a piano's C note because of its other bi-product notes.
>It's an arbitrary choise. What's wrong with 0.25? Or 0.15?
There is nothing wrong - i am supposed to find an optimal size via trial and error - which I will do as soon as the above parts work. :)
>Sure. The DFT is linear.
:) Yes, but is music linear? I assumed it was, because music is a wave, and two waves that have the same phase and same period, their amplitudes shall add. Thank you very much for your comments, Rune. -------------------------------
>Richard Dobson wrote: >What is the pedagogical purpose of this? What academic level? (2) (and >(3) from it) is a formidable problem for polyphonic sources (see MPEG-7,
>Blind Source Separation etc)
I am studying for a BSC in CS, and this is my final project to submit. it is due in a few weeks. I can use some library classes, but I need to come up with algorithms myself and inmplement them myself.
>Hope that's a typo - should be 261.626. Taking the lowest partial of a >tone as the fundamental pitch works a lot of the time, but not all the >tiume. Many oboe notes for example have a 2nd harmonic stronger than the
>fundamental. >
Yes, typo. Thank you. Other than that, the resuklts I am getting through FFT don't indicate that 261.626 has any importance! Some frequencies get amplitude 1500 and 261.626 gets only about 500 sometimes! So it is not the first and not even the second! thank you for your comments, Mr. Dobson. --------------- Thank you both.
Shimon M wrote:
> Thank you both for your comments. > >> Rune said: >> Use the FFT. The FFT is a standard tool that everybody >> understand how works. >> > > Thank you. I just used the FFT on www.fftcalculator.com > I got similar results :(
Such things are worse than useless for serious work of the kind you are contemplating. That work requires a deep understanding of how the FFT works, and in particular the significance of (a) the sample rate of the audio being analysed and (B) the minimum FFT length (window size) needed to resolve low fundamentals and close frequency components. Go here first: http://www.dspdimension.com/admin/dft-a-pied Then go here are read as much as you can: http://ccrma.stanford.edu/~jos/sasp
> However, the way I produce data from a wave file is AudioInputStream and > SourceDataLine - they produce an array of int(s)!!!
Unfortunate it is you appear to be dependent on using Java. <flamebait> Don't use java. </flamebait> There is a huge amount of free high-quality audio processing code on the net, none of it written in Java. It is all C, C++ (mostly C). The site you cited with a java FFT is at the very least suspect, as it talks variously about "an array of bytes" but then also discusses Endianness (not very well!) and casually mentions that audio data come is floats (well it may do in java, but real audio files come in all sorts of formats). The most usual cause of pain in processing audio is mixing up bytes, ints, floats and so forth. ...
>> Richard Dobson wrote: >> What is the pedagogical purpose of this? What academic level? (2) (and >> (3) from it) is a formidable problem for polyphonic sources (see MPEG-7, > >> Blind Source Separation etc) > > I am studying for a BSC in CS, and this is my final project to submit. it > is due in a few weeks. I can use some library classes, but I need to come > up with algorithms myself and inmplement them myself. >
In that case, you are up against it. Abandon the idea of separating polyphonic sources in one soundfile. One useful thing you just ~might~ be able to do in the time is stick to analyses of single instrument tones, and characterize them in terms of their ~time-varying~ spectra (which of course requires your FFT to work). You need much finer resolution than frames every 0.2 seconds. Try every 25msecs. Real musical instruments do not have a static spectrum that you can simply characterize in terms of strengths of harmonics. the guitar is the most obvious example - harmonically complex attack (almost chaotic indeed), with a very rapidy decaying envelope, with high frequencies decaying faster than low ones (as it the usual case). This is therefore an elementary exercise in ~partial tracking~ (see the Julius Smith ref above) . The C++ CLAM library and tools support this and much else besides: http://clam-project.org Once you have an FFT working (use a good existing library for that!), you might consider creating a sonogram display (spectrum/time). Try the one in Audacity (www.audacity.org) to see what this is. The different characters of guitar v piano v flute etc will show up very well. And it will help you empirically understand about the relationship between FFT size, sample rate, and freqency resolution. And the use of Windowing. Also look at Csound (www.csounds.com) - which has tools for partial tracking and all manner of spectral analysis, including basic FFTs. But I fear that if the FFT is as opaque to you as your comments suggest, you may have approached this project far too late in the day. One basic issue is that different FFT implementations scale amplitudes in different ways - you need to find out what those ways are for the FFT you are using. Richard Dobson
Thank you for the resources - Also,
The algorithm we had to implement for the project appears in page 52
(psudeo code at page 54)
http://www.math.ias.edu/~akavia/AkaviaPhDThesis.pdf

If you could just glance at it and tell me if it is good. It is called a
SFT (Short time fourier transfrom).

>Unfortunate it is you appear to be dependent on using Java.
I guess I could try to incorparate C code with java.
>The site you cited with a java FFT is at the very least suspect, as it >talks variously about "an array of bytes" but then also discusses >Endianness (not very well!) and casually mentions that audio data come >is floats (well it may do in java, but real audio files come in all >sorts of formats). The most usual cause of pain in processing audio is >mixing up bytes, ints, floats and so forth.
>resolution than frames every 0.2 seconds. Try every 25msecs.
thank you.
> >Real musical instruments do not have a static spectrum that you can >simply characterize in terms of strengths of harmonics. the guitar is >the most obvious example - harmonically complex attack (almost chaotic >indeed), with a very rapidy decaying envelope, with high frequencies >decaying faster than low ones (as it the usual case). > >This is therefore an elementary exercise in ~partial tracking~ (see the >Julius Smith ref above) . The C++ CLAM library and tools support this >and much else besides: >
I understand that a guitar has a dynamic spectrum - but if I preform FFT )or SFT) on a basic sample of a guitar playing C, will it not somehow resemble to the FFT I will later use on a file that plays the same guitar C? So, I understand it is not perfect - but will it approximatley be most matching out of, let's say, 2 other instruments and their C "fingerprints"? That is all I need.
Shimon M wrote:



> I understand that a guitar has a dynamic spectrum - but if I preform FFT > )or SFT) on a basic sample of a guitar playing C, will it not somehow > resemble to the FFT I will later use on a file that plays the same guitar > C? So, I understand it is not perfect - but will it approximatley be most > matching out of, let's say, 2 other instruments and their C "fingerprints"? > That is all I need.
What kind of guitar? Acoustic will differ markedly from electric, especially the kinds of amplifier treatment common in rock. A sustained tone won't resemble the plucked sound at all. You don't need an FFT to know that. Your ears can tell you. Listen before forming a theory. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Shimon M wrote:
> Thank you for the resources - Also, > The algorithm we had to implement for the project appears in page 52 > (psudeo code at page 54) > http://www.math.ias.edu/~akavia/AkaviaPhDThesis.pdf > > If you could just glance at it and tell me if it is good. It is called a > SFT (Short time fourier transfrom). >
That is usually abbreviated to STFT (and is very relevant to your project, e.g. to make spectrograms/sonograms). That isn't what the paper describes. They call it a "Significant Fourier Transform", in effect a faster data-reduced version of the FFT in the context of data encryption. Now there may be an advanced research project in there, to see if their "SFT" has any special application for audio (I have no idea), but that probably needs rather more than two weeks. Just use the "bog-standard" FFT. If (from what you say ) you already have a working implementation of their SFT, then you will have to demonstrate that it meets the requirements of the project; and justify that choice other than on the basis that it was what you had lying around at the time. ..
> I understand that a guitar has a dynamic spectrum - but if I preform FFT > )or SFT) on a basic sample of a guitar playing C, will it not somehow > resemble to the FFT I will later use on a file that plays the same guitar > C? So, I understand it is not perfect - but will it approximatley be most > matching out of, let's say, 2 other instruments and their C "fingerprints"? > That is all I need.
Instrument notes can be quite long (and electic guitar one especially so!). Are you proposing to analyse a brief moment from the whole (50msecs?), or take a single large FFT of the whole file? There are standard techniques for describing static spectra - such as the Spectral Centroid (q.v.) a weighted average of the frequencies (where the amplitudes are the weights) - gives a general sense of the brightness or otherwise of a sound. In essence, if you nailed the spectrum loosely to a wall at its midpoint, would it topple to the left or to the right, Or just balance exactly. If your sounds are sufficiently timbrally distinct, you should be able to, well, distinguish them. But that is not really much of a project. You are testing a technique dependent on having strongly dissimilar sources, to demonstrate that you can distinguish dissimilar sources! You are looking at a basic level of timbre classification. Given the constraints above it will probably work. Not sure how many brownie points you could reasonably get for it though. What would be ~interesting~ would be to be able to tell what guitar string was being used to generate a given note "C" (or whichever). Middle C is a relatively high note for a guitar - by my reckoning at least five strings can be used to play it - they will be similar but different. Assuming all your FFTs are of the same length, you could extract the (static) spectral envelope (as used to identify vocal formants, for example), and simply find their differences; and then determine how useful that is in distinguishing one from another. Music researchers use the idea of a "timbre space" (usually illustrated on 3 axes, but I think higher dimensions are used too; MPEG-7 defines 17 primary sound descriptors); if you can classify your test sounds in terms of such a timbre space, then you have the basis for a useful project. Richard Dobson
> Richard Dobson wrote: > Once you have an FFT working (use a good existing library for that!), > you might consider creating a sonogram display (spectrum/time). Try the
> one in Audacity (www.audacity.org) to see what this is. The different > characters of guitar v piano v flute etc will show up very well. And it
> will help you empirically understand about the relationship between FFT
> size, sample rate, and freqency resolution. And the use of Windowing.
> Also look at Csound (www.csounds.com) - which has tools for partial > tracking and all manner of spectral analysis, including basic FFTs.
I read your "FFT in a day" and also tried the above links - but I was still baffled. I tried Audacity's program on a wave file produced like this: a recurring C4 (261.626 Hz), and then tried Analyze > Plot spectrum and got: 255.706787 -54.295128 258.398438 -48.651611 261.090088 -40.968891 (no apparent importance) 263.781738 -39.088898 266.473389 -44.641678 269.165039 -54.939251 but then - I tried changing the axis to "log frequency" and got: apperant peaks at 130 and 261 - so I am stasified. So, I conclude the problem is this "log axis" - I shall now research, and try to incorporate it with the SFT or FFT, implement it myself and proceed with my other tasks in the project. Thank you all. You've been a great help.
> Richard Dobson wrote: > Once you have an FFT working (use a good existing library for that!), > you might consider creating a sonogram display (spectrum/time). Try the
> one in Audacity (www.audacity.org) to see what this is. The different > characters of guitar v piano v flute etc will show up very well. And it
> will help you empirically understand about the relationship between FFT
> size, sample rate, and freqency resolution. And the use of Windowing.
> Also look at Csound (www.csounds.com) - which has tools for partial > tracking and all manner of spectral analysis, including basic FFTs.
I read your "FFT in a day" and also tried the above links - but I was still baffled. I tried Audacity's program on a wave file produced like this: a recurring C4 (261.626 Hz), and then tried Analyze > Plot spectrum and got: 255.706787 -54.295128 258.398438 -48.651611 261.090088 -40.968891 (no apparent importance) 263.781738 -39.088898 266.473389 -44.641678 269.165039 -54.939251 but then - I tried changing the axis to "log frequency" and got: apperant peaks at 130 and 261 - so I am stasified. So, I conclude the problem is this "log axis" - I shall now research, and try to incorporate it with the SFT or FFT, implement it myself and proceed with my other tasks in the project. Thank you all. You've been a great help.