Hi DSP experts!
I have a question that i hope some of you can help me with. ;)

What i want to is: given two audio clips, calculate a score for how
similar they are. (how similar they sound)

I assume i have to apply the Fourier transformation on the two clips,
and somehow analyze the two frames (for example by comparing peaks) to
see how similar they are.

How should i do this?

I will be eternally grateful for any pointers! (ideas, explanations,
pointers to literature (websites, etc), ...)

Let me just note that i'm very inexperienced with digital signal
processing - i don't know much about it or DSP terminology. All i've
had was a CS course "introduction to digital audio", where i made a
FFT algorithm and some filters and timestretch.

-----------------------------------------------------

Now that i've asked the question, maybe i should briefly explain what
i'm gonna use it for, to give you some idea of what i'm after. This
may be boring to you - in that case, just skip the rest of this
post :).

I'm trying to make a program that takes a normal audio clip as input
(wav-file) and then approximates the input sound with simple waveforms
(triangle, sawtooth, pulse, noise). Why? The reason i want to do this,
is so that the approximation to the input sound can be played on an
old computer, which can play 3 voices of these simple waveforms, but
is incapable of playing digitized sounds.

I use FFT with windowing, and thus only approximates small parts of
the input sound by the 3 waveforms at a time (not the entire sound -
it would of course be impossible to approximate anything but the
simplest input sound by 3 simple waveforms, if the 3 waveforms didn't
vary over time).

There are some parameters of the 3 waveforms i can vary (freq, volume,
etc). For the frame F of each burst B of the input sound, i run
through all values of these parameters, to find which set of
parameters best approximates the input sound. For each of these sets
of parameters, i generate the sound-samples for the 3 waveforms, and
does the FFT on it to get the frame F'. So now i have the two frames F
and F' (one for the input sound and one for the generated sound). What
i want to do, is to compare these two frames, and get a score for how
similar they are, so that i can find the set of parameters that best
approximates the burst B.

I have made a simple comparator, to compare the two frames F and F',
just to test that the rest of the code works. It simply returns a
score for how well the peaks in F matches the peaks in F' (and ignores
everything else but the peaks). (and the way it compares the peaks is
a bit too naive and simple)

This simple method works a bit (it can often follow tones), but as i
said it's just naive sloppy work to see if the rest worked. Before i
begin putting too much work into improving it, it might be best to get
to know what other people have done. Is this the best approach to
compare two frames? If so, could you point me to some literature
(websites, etc) about it? If not, how should i compare the two frames
instead?