On Sunday, July 22, 2012 8:52:20 AM UTC+12, niarn wrote:
> Hi rbj,
> 
> Sorry about the delay. Don&#39;t know if you&#39;re still interested in this. I
> have found a reference (a book) that describes exactly the procedure I had
> planned to write down. The reference is very easy to follow and I have used
> the book myself back when I was a student. The book is &quot;Neural Networks for
> Pattern Recognition&quot; by Christopher M. Bishop. Chapter 6 is called &quot;Error
> functions&quot; and it starts out by giving an easy to follow outline of the
> principle of Maximum Likelihood (ML). 4 pages into chapter 6 on page 194,
> second paragraph, I quote 
> 
> &quot;We have derived the sum-of-squares error function from the principle of
> maximum likelihood on the assumption of Gaussian distributed target data.
> Of course the use of a sum-of-squares error does not require the target
> data to have a Gaussian distribution. Later in this chapter we shall
> consider the least-squares solution for an example problem with a strongly
> non-Gaussian distribution. However, as we shall see, if we use a
> sum-of-squares error, then the results we obtain cannot distinguish between
> the true distribution and any other distribution having the same mean and
> variance.&quot; 
> 
> 
> Maybe that chapter is of interest to you.

If you start from max likelihood and use a Laplace distribution you won't get ordinary LMS, you'll get sign() LMS. Of course you could argue that this derivation isn't LMS in the first place.

Hi rbj,

Sorry about the delay. Don't know if you're still interested in this. I
have found a reference (a book) that describes exactly the procedure I had
planned to write down. The reference is very easy to follow and I have used
the book myself back when I was a student. The book is "Neural Networks for
Pattern Recognition" by Christopher M. Bishop. Chapter 6 is called "Error
functions" and it starts out by giving an easy to follow outline of the
principle of Maximum Likelihood (ML). 4 pages into chapter 6 on page 194,
second paragraph, I quote 

"We have derived the sum-of-squares error function from the principle of
maximum likelihood on the assumption of Gaussian distributed target data.
Of course the use of a sum-of-squares error does not require the target
data to have a Gaussian distribution. Later in this chapter we shall
consider the least-squares solution for an example problem with a strongly
non-Gaussian distribution. However, as we shall see, if we use a
sum-of-squares error, then the results we obtain cannot distinguish between
the true distribution and any other distribution having the same mean and
variance." 


Maybe that chapter is of interest to you.

>i think so.  i just presented a formulation, so i might ask you the 
>same, do you understand and accept it as how i presented?  

Yes, I do. I first learned about LMS in a book by S. Haykin where LMS is
coached, if not exactly, then in a way very similar to the way to coach it.
The motivation for this other formulation is that it paves the way for
another level of statistical treatment.


>> Typically, if I remember correct this is sometimes written in the
>> following way, e[n] ~ N(0,sigma)
>>
>> Then we have: d[n] ~ N(y[n],sigma)
>
>that does not directly follow.  what if all of your h[n,i] are zero 
>except one (say h[n,0]) and x[n] is not normal?

x[n] is given to us, it is not considered a random variable in this model.
The coefficients h[n,m] are unknown but deterministic. They CAN be
considered as random variables. In fact this is what they do in the code I
pasted previously. There the coefficients were assumed to evolve according
to first order markov processes but that takes us into a Bayesian setup.

It is important to note that x[n] are given to us and that we regard h[n,m]
as being deterministic.

>>
>> Does this make any sense?
>
>sorta, it's stated in a manner that i think i know what you're saying 
>mathematically, but the assumption of normal p.d.f. is not necessary, is
it?

e[n] is a random variable and we are free to choose any distribution for
it. If we have some prior knowledge about it such as for instance that it
is heavy tailed we might choose a different (heavy tailed) distribution for
e[n]. But we take it to be normal for know.


Unfortunately, I don't have internet access for a week (such places do
exist). Maybe Hardy can take over otherwise I hope we can take the next 2
or so steps in a week from now.

On 7/14/12 3:52 PM, niarn wrote:
>>> Using your notation can you then accept this formulation
>>> as a formulation of adaptive FIR filtering:
>>> "Given d[n] and x[n] and given some assumptions about e[n], then in
> some
>>> sense estimate/compute h[n,m]."
>
>> apart from the doubletalk issue, yes, the problem statement is to do
>> something about h[n,i] so that e[n] is reduced to a minimum in a
>> mean-square sense.  that is what the "LMS" is about.
>
> I assume then that you accept the formulation of the adaptive FIR
> filtering.

i think so.  i just presented a formulation, so i might ask you the 
same, do you understand and accept it as how i presented?  you see the 
difference between e[n] and v[n]?.  i am still a little more careful to 
present d[n] as a real signal that has it's own independent origin and 
that origin has *no* e[n].  *you* (or your LMS filter) in practice 
define e[n] in terms of y[n] and d[n].  d[n] is supplied as an input.

>
> The next step is then to write the expression for the PDF of d[n], written
> as p(d[n];h[n,m]). Note that this is not a conditional PDF, the semi colon
> is there to note/express that the PDF of d[n] is parameterized by h[n,m].
> In our model
>
>                 N-1
>       d[n]  =   SUM{ h[n,i] * x[n-i] } + e[n]   =   y[n] + e[n]
>                 i=0
>

> we now take e[n] to be normally distributed with zero mean and variance
> sigma.

in general, least-square fitting *does* make an assumption that the 
error is zero mean and finite variance, but there is *no* assumption of 
normal or gaussian p.d.f.

> Typically, if I remember correct this is sometimes written in the
> following way, e[n] ~ N(0,sigma)
>
> Then we have: d[n] ~ N(y[n],sigma)

that does not directly follow.  what if all of your h[n,i] are zero 
except one (say h[n,0]) and x[n] is not normal?

> or if we want to write the detailed expression
>
> p(d[n];h[n,m]) = sqrt(2 pi sigma)^{-1} exp(-0.5 (d[n] - y[n])^2 / sigma )
>
>
> Does this make any sense?

sorta, it's stated in a manner that i think i know what you're saying 
mathematically, but the assumption of normal p.d.f. is not necessary, is it?

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

>> Using your notation can you then accept this formulation
>> as a formulation of adaptive FIR filtering:
>> "Given d[n] and x[n] and given some assumptions about e[n], then in
some
>> sense estimate/compute h[n,m]."

>apart from the doubletalk issue, yes, the problem statement is to do 
>something about h[n,i] so that e[n] is reduced to a minimum in a 
>mean-square sense.  that is what the "LMS" is about.

I assume then that you accept the formulation of the adaptive FIR
filtering. 

The next step is then to write the expression for the PDF of d[n], written
as p(d[n];h[n,m]). Note that this is not a conditional PDF, the semi colon
is there to note/express that the PDF of d[n] is parameterized by h[n,m].
In our model

               N-1
     d[n]  =   SUM{ h[n,i] * x[n-i] } + e[n]   =   y[n] + e[n]
               i=0

we now take e[n] to be normally distributed with zero mean and variance
sigma. Typically, if I remember correct this is sometimes written in the
following way, e[n] ~ N(0,sigma)

Then we have: d[n] ~ N(y[n],sigma)

or if we want to write the detailed expression

p(d[n];h[n,m]) = sqrt(2 pi sigma)^{-1} exp(-0.5 (d[n] - y[n])^2 / sigma )


Does this make any sense?

i made a mistake about the Normalized LMS ...

On 7/13/12 2:22 AM, robert bristow-johnson wrote:
...
>
>  h[n+1,i] = h[n,i] - mu/2 * d(e[n]^2)/d(h[n,i])
>
>           = h[n,i] - mu * ( d[n] - y[n] )*( x[n-i] )
>
>           = h[n,i] - mu*e[n]*x[n-i]
>
>
> this moves down the slope of the e[n]^2 vs. h[n,i] (w.r.t. various i).
> the speed that it moves down that slope is proportional to mu, which is
> called the "adaptation gain". for the normalized LMS, you want to scale
> mu to be proportional to 1/(e[n]*x[n-i]) which is sorta proportional to
> 1/(x[n]^2). so normalized LMS looks like
>
>
>     h[n+1,i] = h[n,i] - (mu/(x[n])^2)*e[n]*x[n-i]
>



this should say:

for the normalized LMS, you want to scale mu to be proportional to

       1/mean{ (x[n])^2 }

where
                           N-1
      mean{ (x[n])^2 }  =  SUM{ (x[n-i])^2 * w[i]}
                           i=0

for some window function w[i].

so

       h[n+1,i]  =  h[n,i]  -  mu/mean{(x[n])^2} * e[n]*x[n-i]


there is a sorta adaptation trick to track 1/mean{ (x[n])^2 } without 
doing division if this were in a DSP where division is hard or expensive.


-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

On 7/13/12 3:59 PM, niarn wrote:
>> niarn, it helps when doing "ASCII math", to set your font to a
>> mono-spaced font, so you can line things up (like the limits to the
>> summation) better for everyone to see.  otherwise i have to guess at
>> which font you're seeing things with.
>
>
> I'm not sure I can control the font, I'm using a simple text edit control
> on the dsprelated website. But I will do what I can.
>
>
>>> This is the 'FIR model'
>>>
>>>                 N-1
>>>      y[n]  =   SUM{ h[n,i] * x[n-i] }  + e[n]
>>>                 i=0
>>
>> that is *not* the FIR model.   there is no e[n] in the FIR model.  but
>> there *is* a "desired signal", d[n], that is based on the same x[n]
>> going into the FIR and that we're trying to match the FIR to.  very
>> similar to "system identification".
>
> There must be an e[n] in there. That is the error that accounts for the
> model mismatch. Then let's call it observation model or signal model. I
> think we mean exactly the same it is just a matter of aligning what we call
> things. Using your equations, you write
>
>>      e[n]  =  d[n]  -  y[n]
>
> Let us rearrange to
>
>       d[n]  =  y[n]  +  e[n]
>
> Then substitute your expression for y[n]
>
>                 N-1
>       d[n]  =   SUM{ h[n,i] * x[n-i] } + e[n]
>                 i=0
>
> This is exactly what I before referred to as the 'FIR model' but now with
> different symbols.

right, i had y[n] for the actual FIR output.

i just wanted to show that the so-called "desired" signal, d[n] has two 
very similar expressions, both:

              +inf
     d[n]  =   SUM{ g[i] * x[n-i] }   +   v[n]
               i=0
                                  (notice i changed something...)
and

     d[n]  =   y[n]  +  e[n]

               N-1
           =   SUM{ h[n,i] * x[n-i] }  +  e[n]
               i=0

forgive me for referring to the speaker feedback canceling app, but it's 
handy for illustration.  the first d[n] is what arrives at the 
microphone of the speakerphone, it is the remote signal x[n] (the other 
party) getting processed by the loudspeaker and the acoustic transfer 
function (the room and reflections) from loudspeaker to microphone. 
v[n] is the present speaker speaking.  the g[i] coefs are due to the 
physical situation, you have no control of them, but you might like 
knowing what they are so you can subtract the feedback signal and have 
only v[n] left.  *but* from the POV of estimating that remote signal and 
transfer function, v[n] is considered "interference".

now, let's say that you need not worry about g[i] for i >= N .
if h[n,i] = g[i], then this error signal e[n] that you compute, is 
identical to v[n].  we're done and we need to do no more adapting and 
h[n+1,i] = h[n,i] and the system is static.

but if the coefficients that you *can* adjust, h[n,i] do not match the 
target coefs, g[i], then the error signal e[n] has two components:

                    N-1
          e[n]  =   SUM{ (g[i] - h[n,i]) * x[n-i] }  +  v[n]
                    i=0

it has a component that the adaptation algorithm should force to zero 
and it has a component that is completely independent.

"doubletalk" is the problem with this e[n] when *both* x[n] and v[n] are 
not zero (over a period of time).  when both signals are hot, then 
trying to reduce e[n] to zero might not be what you want.  you really 
just want that first term going to zero in some sense.

> Using your notation can you then accept this formulation
> as a formulation of adaptive FIR filtering:
> "Given d[n] and x[n] and given some assumptions about e[n], then in some
> sense estimate/compute h[n,m]."

apart from the doubletalk issue, yes, the problem statement is to do 
something about h[n,i] so that e[n] is reduced to a minimum in a 
mean-square sense.  that is what the "LMS" is about.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

>niarn, it helps when doing "ASCII math", to set your font to a 
>mono-spaced font, so you can line things up (like the limits to the 
>summation) better for everyone to see.  otherwise i have to guess at 
>which font you're seeing things with.


I'm not sure I can control the font, I'm using a simple text edit control
on the dsprelated website. But I will do what I can.


>> This is the 'FIR model'
>>
>>                N-1
>>     y[n]  =   SUM{ h[n,i] * x[n-i] }  + e[n]
>>                i=0
>
>that is *not* the FIR model.   there is no e[n] in the FIR model.  but 
>there *is* a "desired signal", d[n], that is based on the same x[n] 
>going into the FIR and that we're trying to match the FIR to.  very 
>similar to "system identification".

There must be an e[n] in there. That is the error that accounts for the
model mismatch. Then let's call it observation model or signal model. I
think we mean exactly the same it is just a matter of aligning what we call
things. Using your equations, you write

>     e[n]  =  d[n]  -  y[n]

Let us rearrange to 

     d[n]  =  y[n]  +  e[n]

Then substitute your expression for y[n]

               N-1
     d[n]  =   SUM{ h[n,i] * x[n-i] } + e[n]
               i=0

This is exactly what I before referred to as the 'FIR model' but now with
different symbols. Using your notation can you then accept this formulation
as a formulation of adaptive FIR filtering:
"Given d[n] and x[n] and given some assumptions about e[n], then in some
sense estimate/compute h[n,m]."

niarn, it helps when doing "ASCII math", to set your font to a 
mono-spaced font, so you can line things up (like the limits to the 
summation) better for everyone to see.  otherwise i have to guess at 
which font you're seeing things with.

On 7/13/12 10:14 AM, niarn wrote:
>> now, Hardy, can you relate this to anything that you're talking about?
>
> Hi rbj, Let's do it in steps.
>
> This is the 'FIR model'
>
>                            N-1
>     y[n]  =   SUM{ h[n,i] * x[n-i] }  + e[n]
>                            i=0

that is *not* the FIR model.   there is no e[n] in the FIR model.  but 
there *is* a "desired signal", d[n], that is based on the same x[n] 
going into the FIR and that we're trying to match the FIR to.  very 
similar to "system identification".

> Do you accept this formulation as a formulation of adaptive FIR filtering:
> "Given y[n] and x[n] and given some assumptions about e[n], then in some
> sense estimate/compute h[n]."

a time-invariant FIR is, of course

               N-1
     y[n]  =   SUM{ h[i] * x[n-i] }
               i=0

this is LTI and h[i] can be thought of as the unchanging FIR 
coefficients or h[n] can be thought of as the impulse response of the 
FIR.  (we'll define "n" as discrete "time").

a time-variant FIR would be

               N-1
     y[n]  =   SUM{ h[n,i] * x[n-i] }
               i=0

the impulse response is not well-defined in this case (because it's 
different for impulses applied at different times), but h[n,i] can be 
thought of as the FIR coefficients at time n, or h[m,n] is the impulse 
response of an impulse applied at time m.

we are modeling the desired signal, d[n], as being some *other* filter 
(could be FIR but doesn't have to be) acting on the same input, x[n]. 
so, for shits and grins, let's say it *is* an FIR (same length, N):

               N-1
     d[n]  =   SUM{ g[i] * x[n-i] }
               i=0

but the coefficients, g[i], are hidden and considered fixed (or *very* 
slowly varying) and we have to figger out what they are by looking at 
the input, x[n], and output, d[n], and we have access to both signals.

this would be easy (non-blind deconvolution: use the DFT and iDFT to get 
g[i]) but there is also interference, i'll call v[n].  now this 
interference *might* be exactly the signal you're looking for (it *is* 
in speakerphone echo/feedback cancellation) and the above d[n] (the 
feedback signal from speaker into microphone) is what you are trying to 
get rid of.  so we add this interference signal:

               N-1
     d[n]  =   SUM{ g[i] * x[n-i] }   +   v[n]
               i=0

now, if

               N-1
     y[n]  =   SUM{ h[n,i] * x[n-i] }
               i=0

and if

     e[n]  =  d[n]  -  y[n]

and we can compute this e[n] in real time because we compute y[n] and 
d[n] is given.  then, *only* if h[n,i] has converged to g[i] will this 
apparent error e[n] be the same as v[n].

now, going back the the speakerphone application, when there is no 
"double-talk", then v[n] should be close to zero, and the only reason 
that e[n] is not zero is because h[n,i] is not (yet) the same as the 
target coefficients g[i].  the idea of the simple LMS adaptive filter is 
to slowly get h[n,i] to converge to g[i] (for 0 <= i < N), and then when 
v[n] is not zero, *and* hopefully h[n,i] has already converged to near 
g[i], then e[n] = v[n].  a good LMS application will have some other 
system that will detect or estimate doubletalk and will reduce the 
adaptation gain (so h[n,i] remains constant with n) during doubletalk so 
that it doesn't screw up the adaptation.

> I could also have written (but for simplicity only consider the above
> formulation for now):
> "Given y[n] and x[n] and given some assumptions about e[n] and h[n], then
> in some sense estimate/compute h[n]."

you also need a separate desired signal d[n].  you're not given y[n], 
but you can figure it out from x[n] and the *current* FIR coefficients 
h[n,i].  so you have inputs x[n] and d[n], and you're trying to get y[n] 
to match d[n] and "then in some sense [you] estimate/compute h[i]" to be 
whatever those coefs have to be to get y[n] to best match d[n].

a typical simple LMS filter has *two* inputs, x[n] and d[n], and two 
outputs, y[n] and e[n], and often it's v[n] that we're really looking 
for, which, when the adaptation settles down, is the same as e[n].  if 
the application is system identification, then what we're looking for 
are the coefficients, h_d[i] which should be converged upon by h[n,i].

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."