Reply by July 22, 20122012-07-22
On Sunday, July 22, 2012 8:52:20 AM UTC+12, niarn wrote:
> Hi rbj, > > Sorry about the delay. Don't know if you're still interested in this. I > have found a reference (a book) that describes exactly the procedure I had > planned to write down. The reference is very easy to follow and I have used > the book myself back when I was a student. The book is "Neural Networks for > Pattern Recognition" by Christopher M. Bishop. Chapter 6 is called "Error > functions" and it starts out by giving an easy to follow outline of the > principle of Maximum Likelihood (ML). 4 pages into chapter 6 on page 194, > second paragraph, I quote > > "We have derived the sum-of-squares error function from the principle of > maximum likelihood on the assumption of Gaussian distributed target data. > Of course the use of a sum-of-squares error does not require the target > data to have a Gaussian distribution. Later in this chapter we shall > consider the least-squares solution for an example problem with a strongly > non-Gaussian distribution. However, as we shall see, if we use a > sum-of-squares error, then the results we obtain cannot distinguish between > the true distribution and any other distribution having the same mean and > variance." > > > Maybe that chapter is of interest to you.
If you start from max likelihood and use a Laplace distribution you won't get ordinary LMS, you'll get sign() LMS. Of course you could argue that this derivation isn't LMS in the first place.
Reply by niarn July 21, 20122012-07-21
Hi rbj,

Sorry about the delay. Don't know if you're still interested in this. I
have found a reference (a book) that describes exactly the procedure I had
planned to write down. The reference is very easy to follow and I have used
the book myself back when I was a student. The book is "Neural Networks for
Pattern Recognition" by Christopher M. Bishop. Chapter 6 is called "Error
functions" and it starts out by giving an easy to follow outline of the
principle of Maximum Likelihood (ML). 4 pages into chapter 6 on page 194,
second paragraph, I quote 

"We have derived the sum-of-squares error function from the principle of
maximum likelihood on the assumption of Gaussian distributed target data.
Of course the use of a sum-of-squares error does not require the target
data to have a Gaussian distribution. Later in this chapter we shall
consider the least-squares solution for an example problem with a strongly
non-Gaussian distribution. However, as we shall see, if we use a
sum-of-squares error, then the results we obtain cannot distinguish between
the true distribution and any other distribution having the same mean and
variance." 


Maybe that chapter is of interest to you.
Reply by niarn July 14, 20122012-07-14
>i think so. i just presented a formulation, so i might ask you the >same, do you understand and accept it as how i presented?
Yes, I do. I first learned about LMS in a book by S. Haykin where LMS is coached, if not exactly, then in a way very similar to the way to coach it. The motivation for this other formulation is that it paves the way for another level of statistical treatment.
>> Typically, if I remember correct this is sometimes written in the >> following way, e[n] ~ N(0,sigma) >> >> Then we have: d[n] ~ N(y[n],sigma) > >that does not directly follow. what if all of your h[n,i] are zero >except one (say h[n,0]) and x[n] is not normal?
x[n] is given to us, it is not considered a random variable in this model. The coefficients h[n,m] are unknown but deterministic. They CAN be considered as random variables. In fact this is what they do in the code I pasted previously. There the coefficients were assumed to evolve according to first order markov processes but that takes us into a Bayesian setup. It is important to note that x[n] are given to us and that we regard h[n,m] as being deterministic.
>> >> Does this make any sense? > >sorta, it's stated in a manner that i think i know what you're saying >mathematically, but the assumption of normal p.d.f. is not necessary, is
it? e[n] is a random variable and we are free to choose any distribution for it. If we have some prior knowledge about it such as for instance that it is heavy tailed we might choose a different (heavy tailed) distribution for e[n]. But we take it to be normal for know. Unfortunately, I don't have internet access for a week (such places do exist). Maybe Hardy can take over otherwise I hope we can take the next 2 or so steps in a week from now.
Reply by robert bristow-johnson July 14, 20122012-07-14
On 7/14/12 3:52 PM, niarn wrote:
>>> Using your notation can you then accept this formulation >>> as a formulation of adaptive FIR filtering: >>> "Given d[n] and x[n] and given some assumptions about e[n], then in > some >>> sense estimate/compute h[n,m]." > >> apart from the doubletalk issue, yes, the problem statement is to do >> something about h[n,i] so that e[n] is reduced to a minimum in a >> mean-square sense. that is what the "LMS" is about. > > I assume then that you accept the formulation of the adaptive FIR > filtering.
i think so. i just presented a formulation, so i might ask you the same, do you understand and accept it as how i presented? you see the difference between e[n] and v[n]?. i am still a little more careful to present d[n] as a real signal that has it's own independent origin and that origin has *no* e[n]. *you* (or your LMS filter) in practice define e[n] in terms of y[n] and d[n]. d[n] is supplied as an input.
> > The next step is then to write the expression for the PDF of d[n], written > as p(d[n];h[n,m]). Note that this is not a conditional PDF, the semi colon > is there to note/express that the PDF of d[n] is parameterized by h[n,m]. > In our model > > N-1 > d[n] = SUM{ h[n,i] * x[n-i] } + e[n] = y[n] + e[n] > i=0 >
> we now take e[n] to be normally distributed with zero mean and variance > sigma.
in general, least-square fitting *does* make an assumption that the error is zero mean and finite variance, but there is *no* assumption of normal or gaussian p.d.f.
> Typically, if I remember correct this is sometimes written in the > following way, e[n] ~ N(0,sigma) > > Then we have: d[n] ~ N(y[n],sigma)
that does not directly follow. what if all of your h[n,i] are zero except one (say h[n,0]) and x[n] is not normal?
> or if we want to write the detailed expression > > p(d[n];h[n,m]) = sqrt(2 pi sigma)^{-1} exp(-0.5 (d[n] - y[n])^2 / sigma ) > > > Does this make any sense?
sorta, it's stated in a manner that i think i know what you're saying mathematically, but the assumption of normal p.d.f. is not necessary, is it? -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by niarn July 14, 20122012-07-14
>> Using your notation can you then accept this formulation >> as a formulation of adaptive FIR filtering: >> "Given d[n] and x[n] and given some assumptions about e[n], then in
some
>> sense estimate/compute h[n,m]."
>apart from the doubletalk issue, yes, the problem statement is to do >something about h[n,i] so that e[n] is reduced to a minimum in a >mean-square sense. that is what the "LMS" is about.
I assume then that you accept the formulation of the adaptive FIR filtering. The next step is then to write the expression for the PDF of d[n], written as p(d[n];h[n,m]). Note that this is not a conditional PDF, the semi colon is there to note/express that the PDF of d[n] is parameterized by h[n,m]. In our model N-1 d[n] = SUM{ h[n,i] * x[n-i] } + e[n] = y[n] + e[n] i=0 we now take e[n] to be normally distributed with zero mean and variance sigma. Typically, if I remember correct this is sometimes written in the following way, e[n] ~ N(0,sigma) Then we have: d[n] ~ N(y[n],sigma) or if we want to write the detailed expression p(d[n];h[n,m]) = sqrt(2 pi sigma)^{-1} exp(-0.5 (d[n] - y[n])^2 / sigma ) Does this make any sense?
Reply by robert bristow-johnson July 13, 20122012-07-13
i made a mistake about the Normalized LMS ...

On 7/13/12 2:22 AM, robert bristow-johnson wrote:
...
> > h[n+1,i] = h[n,i] - mu/2 * d(e[n]^2)/d(h[n,i]) > > = h[n,i] - mu * ( d[n] - y[n] )*( x[n-i] ) > > = h[n,i] - mu*e[n]*x[n-i] > > > this moves down the slope of the e[n]^2 vs. h[n,i] (w.r.t. various i). > the speed that it moves down that slope is proportional to mu, which is > called the "adaptation gain". for the normalized LMS, you want to scale > mu to be proportional to 1/(e[n]*x[n-i]) which is sorta proportional to > 1/(x[n]^2). so normalized LMS looks like > > > h[n+1,i] = h[n,i] - (mu/(x[n])^2)*e[n]*x[n-i] >
this should say: for the normalized LMS, you want to scale mu to be proportional to 1/mean{ (x[n])^2 } where N-1 mean{ (x[n])^2 } = SUM{ (x[n-i])^2 * w[i]} i=0 for some window function w[i]. so h[n+1,i] = h[n,i] - mu/mean{(x[n])^2} * e[n]*x[n-i] there is a sorta adaptation trick to track 1/mean{ (x[n])^2 } without doing division if this were in a DSP where division is hard or expensive. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by robert bristow-johnson July 13, 20122012-07-13
On 7/13/12 3:59 PM, niarn wrote:
>> niarn, it helps when doing "ASCII math", to set your font to a >> mono-spaced font, so you can line things up (like the limits to the >> summation) better for everyone to see. otherwise i have to guess at >> which font you're seeing things with. > > > I'm not sure I can control the font, I'm using a simple text edit control > on the dsprelated website. But I will do what I can. > > >>> This is the 'FIR model' >>> >>> N-1 >>> y[n] = SUM{ h[n,i] * x[n-i] } + e[n] >>> i=0 >> >> that is *not* the FIR model. there is no e[n] in the FIR model. but >> there *is* a "desired signal", d[n], that is based on the same x[n] >> going into the FIR and that we're trying to match the FIR to. very >> similar to "system identification". > > There must be an e[n] in there. That is the error that accounts for the > model mismatch. Then let's call it observation model or signal model. I > think we mean exactly the same it is just a matter of aligning what we call > things. Using your equations, you write > >> e[n] = d[n] - y[n] > > Let us rearrange to > > d[n] = y[n] + e[n] > > Then substitute your expression for y[n] > > N-1 > d[n] = SUM{ h[n,i] * x[n-i] } + e[n] > i=0 > > This is exactly what I before referred to as the 'FIR model' but now with > different symbols.
right, i had y[n] for the actual FIR output. i just wanted to show that the so-called "desired" signal, d[n] has two very similar expressions, both: +inf d[n] = SUM{ g[i] * x[n-i] } + v[n] i=0 (notice i changed something...) and d[n] = y[n] + e[n] N-1 = SUM{ h[n,i] * x[n-i] } + e[n] i=0 forgive me for referring to the speaker feedback canceling app, but it's handy for illustration. the first d[n] is what arrives at the microphone of the speakerphone, it is the remote signal x[n] (the other party) getting processed by the loudspeaker and the acoustic transfer function (the room and reflections) from loudspeaker to microphone. v[n] is the present speaker speaking. the g[i] coefs are due to the physical situation, you have no control of them, but you might like knowing what they are so you can subtract the feedback signal and have only v[n] left. *but* from the POV of estimating that remote signal and transfer function, v[n] is considered "interference". now, let's say that you need not worry about g[i] for i >= N . if h[n,i] = g[i], then this error signal e[n] that you compute, is identical to v[n]. we're done and we need to do no more adapting and h[n+1,i] = h[n,i] and the system is static. but if the coefficients that you *can* adjust, h[n,i] do not match the target coefs, g[i], then the error signal e[n] has two components: N-1 e[n] = SUM{ (g[i] - h[n,i]) * x[n-i] } + v[n] i=0 it has a component that the adaptation algorithm should force to zero and it has a component that is completely independent. "doubletalk" is the problem with this e[n] when *both* x[n] and v[n] are not zero (over a period of time). when both signals are hot, then trying to reduce e[n] to zero might not be what you want. you really just want that first term going to zero in some sense.
> Using your notation can you then accept this formulation > as a formulation of adaptive FIR filtering: > "Given d[n] and x[n] and given some assumptions about e[n], then in some > sense estimate/compute h[n,m]."
apart from the doubletalk issue, yes, the problem statement is to do something about h[n,i] so that e[n] is reduced to a minimum in a mean-square sense. that is what the "LMS" is about. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by niarn July 13, 20122012-07-13
>niarn, it helps when doing "ASCII math", to set your font to a >mono-spaced font, so you can line things up (like the limits to the >summation) better for everyone to see. otherwise i have to guess at >which font you're seeing things with.
I'm not sure I can control the font, I'm using a simple text edit control on the dsprelated website. But I will do what I can.
>> This is the 'FIR model' >> >> N-1 >> y[n] = SUM{ h[n,i] * x[n-i] } + e[n] >> i=0 > >that is *not* the FIR model. there is no e[n] in the FIR model. but >there *is* a "desired signal", d[n], that is based on the same x[n] >going into the FIR and that we're trying to match the FIR to. very >similar to "system identification".
There must be an e[n] in there. That is the error that accounts for the model mismatch. Then let's call it observation model or signal model. I think we mean exactly the same it is just a matter of aligning what we call things. Using your equations, you write
> e[n] = d[n] - y[n]
Let us rearrange to d[n] = y[n] + e[n] Then substitute your expression for y[n] N-1 d[n] = SUM{ h[n,i] * x[n-i] } + e[n] i=0 This is exactly what I before referred to as the 'FIR model' but now with different symbols. Using your notation can you then accept this formulation as a formulation of adaptive FIR filtering: "Given d[n] and x[n] and given some assumptions about e[n], then in some sense estimate/compute h[n,m]."
Reply by robert bristow-johnson July 13, 20122012-07-13
niarn, it helps when doing "ASCII math", to set your font to a 
mono-spaced font, so you can line things up (like the limits to the 
summation) better for everyone to see.  otherwise i have to guess at 
which font you're seeing things with.

On 7/13/12 10:14 AM, niarn wrote:
>> now, Hardy, can you relate this to anything that you're talking about? > > Hi rbj, Let's do it in steps. > > This is the 'FIR model' > > N-1 > y[n] = SUM{ h[n,i] * x[n-i] } + e[n] > i=0
that is *not* the FIR model. there is no e[n] in the FIR model. but there *is* a "desired signal", d[n], that is based on the same x[n] going into the FIR and that we're trying to match the FIR to. very similar to "system identification".
> Do you accept this formulation as a formulation of adaptive FIR filtering: > "Given y[n] and x[n] and given some assumptions about e[n], then in some > sense estimate/compute h[n]."
a time-invariant FIR is, of course N-1 y[n] = SUM{ h[i] * x[n-i] } i=0 this is LTI and h[i] can be thought of as the unchanging FIR coefficients or h[n] can be thought of as the impulse response of the FIR. (we'll define "n" as discrete "time"). a time-variant FIR would be N-1 y[n] = SUM{ h[n,i] * x[n-i] } i=0 the impulse response is not well-defined in this case (because it's different for impulses applied at different times), but h[n,i] can be thought of as the FIR coefficients at time n, or h[m,n] is the impulse response of an impulse applied at time m. we are modeling the desired signal, d[n], as being some *other* filter (could be FIR but doesn't have to be) acting on the same input, x[n]. so, for shits and grins, let's say it *is* an FIR (same length, N): N-1 d[n] = SUM{ g[i] * x[n-i] } i=0 but the coefficients, g[i], are hidden and considered fixed (or *very* slowly varying) and we have to figger out what they are by looking at the input, x[n], and output, d[n], and we have access to both signals. this would be easy (non-blind deconvolution: use the DFT and iDFT to get g[i]) but there is also interference, i'll call v[n]. now this interference *might* be exactly the signal you're looking for (it *is* in speakerphone echo/feedback cancellation) and the above d[n] (the feedback signal from speaker into microphone) is what you are trying to get rid of. so we add this interference signal: N-1 d[n] = SUM{ g[i] * x[n-i] } + v[n] i=0 now, if N-1 y[n] = SUM{ h[n,i] * x[n-i] } i=0 and if e[n] = d[n] - y[n] and we can compute this e[n] in real time because we compute y[n] and d[n] is given. then, *only* if h[n,i] has converged to g[i] will this apparent error e[n] be the same as v[n]. now, going back the the speakerphone application, when there is no "double-talk", then v[n] should be close to zero, and the only reason that e[n] is not zero is because h[n,i] is not (yet) the same as the target coefficients g[i]. the idea of the simple LMS adaptive filter is to slowly get h[n,i] to converge to g[i] (for 0 <= i < N), and then when v[n] is not zero, *and* hopefully h[n,i] has already converged to near g[i], then e[n] = v[n]. a good LMS application will have some other system that will detect or estimate doubletalk and will reduce the adaptation gain (so h[n,i] remains constant with n) during doubletalk so that it doesn't screw up the adaptation.
> I could also have written (but for simplicity only consider the above > formulation for now): > "Given y[n] and x[n] and given some assumptions about e[n] and h[n], then > in some sense estimate/compute h[n]."
you also need a separate desired signal d[n]. you're not given y[n], but you can figure it out from x[n] and the *current* FIR coefficients h[n,i]. so you have inputs x[n] and d[n], and you're trying to get y[n] to match d[n] and "then in some sense [you] estimate/compute h[i]" to be whatever those coefs have to be to get y[n] to best match d[n]. a typical simple LMS filter has *two* inputs, x[n] and d[n], and two outputs, y[n] and e[n], and often it's v[n] that we're really looking for, which, when the adaptation settles down, is the same as e[n]. if the application is system identification, then what we're looking for are the coefficients, h_d[i] which should be converged upon by h[n,i]. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by niarn July 13, 20122012-07-13
>now, Hardy, can you relate this to anything that you're talking about?
Hi rbj, Let's do it in steps. This is the 'FIR model' N-1 y[n] = SUM{ h[n,i] * x[n-i] } + e[n] i=0 Do you accept this formulation as a formulation of adaptive FIR filtering: "Given y[n] and x[n] and given some assumptions about e[n], then in some sense estimate/compute h[n]." I could also have written (but for simplicity only consider the above formulation for now): "Given y[n] and x[n] and given some assumptions about e[n] and h[n], then in some sense estimate/compute h[n]."