DSPRelated.com
Forums

Diff b/w Expectation and Entropy algorithms

Started by Vimal March 12, 2004
Hi,

Can someone explain or give some urls on 'difference' between
expectation-maximization and entropy maximization.

To me both seems to maximize the E(log(p(x))) where p(x) is the pdf,
although both originate from different theories.

Thanks,

Cheers,
Vimal
Vimal wrote:
> Hi, > > Can someone explain or give some urls on 'difference' between > expectation-maximization and entropy maximization. > > To me both seems to maximize the E(log(p(x))) where p(x) is the pdf, > although both originate from different theories. > > Thanks, > > Cheers, > Vimal
I think what you say is not quite right... ... let p(x) be the true pdf and q(x) a candidate for what might be the true pdf. Then "expectation-maximization" seeks to maximise E(log(q(x))) (where E derives from p(x), usually when p(x) is unknown but information derives from a data sample, but p(x) is treated as fixed, while q(x) is variable). ... and, "entropy maximization" seeks to maximise E(log(p(x)) (where E derives from p(x), and both instances of p(x) are the same and variable). David Jones
Thanks David for explanation,

Can I say that:
1) For EM, E-step: \int p(x) log(q(x)) dx
and in 
Maximization wrt some parameter, we can take p(x) as fixed ie 
M : \int p(x) (log(q(x))' dx

and

2) Entropy = - \int p(x) log(p(x)) dx
and in 
maximization, we use differentiation (of multiples rule) for some parameter, we get
M : -\int p(x)' log(p(x)) dx

Cheers,
Vimal
Vimal wrote:
> Thanks David for explanation, > > Can I say that: > 1) For EM, E-step: \int p(x) log(q(x)) dx > and in > Maximization wrt some parameter, we can take p(x) as fixed ie > M : \int p(x) (log(q(x))' dx >
The earlier discussion was really just for "ordinary" maximum likelihood ... if you want to think about the EM algorithm specifically, then I think you need to extend the notation to explicitly distinguish between the observed and "missing" data, since the "expectation" step applies only to the missing part of the data.
> and > > 2) Entropy = - \int p(x) log(p(x)) dx > and in > maximization, we use differentiation (of multiples rule) for some > parameter, we get M : -\int p(x)' log(p(x)) dx > > Cheers, > Vimal
"Vimal" wrote:
> Can someone explain or give some urls on 'difference' between > expectation-maximization and entropy maximization. > > To me both seems to maximize the E(log(p(x))) where p(x) is the pdf, > although both originate from different theories.
Watch out. Maximizing log-likelihood (and EM is a particular approach to this) is similar to *minimizing* the cross or relative entropy (Kullback-Leibler divergence). Maximum entropy principle is about the idea that given a number of equivalent prospective models, you should pick the one with the highest entropy. -- mag. Aleks Jakulin http://ai.fri.uni-lj.si/aleks/ Artificial Intelligence Laboratory, Faculty of Computer and Information Science, University of Ljubljana.
Hi,

If in EM, I assume that p(x) is a known\complete pdf and q(x) being
the pdf of incomplete data, then

Can I say that:
1) For EM, E-step: \int p(x) log(q(x)) dx
and in
Maximization wrt some parameter, we can take p(x) as known then
M : \int p(x) (log(q(x))' dx

Thanks,

Cheers,
Vimal
Thanks Aleks for correction
ya, I also found out that it should be entropy minimization, 

About entropy maximization, 
> Maximum entropy principle is about the idea that given a number of > equivalent prospective models, you should pick the one with the > highest entropy.
I guess EM also tries to do the same, ie tries to maximize the likelihood of incomplete pdf match the complete pdf iteratively for a said parameter. Can you please elaborate the difference. I am a newbie to Information Theoritic approaches. "Aleks Jakulin" <jakulin@@ieee.org> wrote in message news:<c3f59f$nor$1@planja.arnes.si>...
> "Vimal" wrote: > > Can someone explain or give some urls on 'difference' between > > expectation-maximization and entropy maximization. > > > > To me both seems to maximize the E(log(p(x))) where p(x) is the pdf, > > although both originate from different theories. > > Watch out. Maximizing log-likelihood (and EM is a particular approach > to this) is similar to *minimizing* the cross or relative entropy > (Kullback-Leibler divergence). > > Maximum entropy principle is about the idea that given a number of > equivalent prospective models, you should pick the one with the > highest entropy.
Vimal wrote:
> Hi, > > If in EM, I assume that p(x) is a known\complete pdf and q(x) being > the pdf of incomplete data, then > > Can I say that: > 1) For EM, E-step: \int p(x) log(q(x)) dx
No, I think you need tio improve your notation... let p(x,y), q(x,y) be the true and modelled density of joint random variables, where x is observed and y is unobserved .. then the expectation step evaluates E (log(q(y|x))), where the expectation is with respect to the density q(y|x) (ie. the conditional density of y given x). IE. E-step: l*(x)= \int q(y|x)) log(q(y|x)) dy Then the maximisation step is to maximise l*(x) (ie using the sample value of x)... but this is notionally an approximation to the real target of maximising the expectation of l*(x), where the expectation is with respect to the marginal distribution of x. M-step : maximise \int p(x) l*(x) dx. This would probably make more sense if further notation were introduced relating to the parameters being estimated. David Jones
> > Maximum entropy principle is about the idea that given a number of > > equivalent prospective models, you should pick the one with the > > highest entropy. > I guess EM also tries to do the same, ie tries to maximize the > likelihood of incomplete pdf match the complete pdf iteratively for > said parameter.
You misunderstand maximum entropy. Assume the true pdf p(x), and the modelled pdf q(x). The purpose of cross-entropy minimization is to find a q(x) so that the KL-divergence between the two pdfs will be minimal: q := argmin_q' D(p || q') Assume that q is a parametric model, with parameter t. Then q(x)=p(x | t). The purpose of maximum likelihood (ML) is to pick the model (or parameters) given which the data are likeliest: t := argmax_t' p(x | t') q(x) := p(x | t) Although these two look similar, they are *not* the same! Let's now imagine that we have a number of models, q1(x), q2(x), q3(x). The purpose of MaxEnt is to pick the one with highest entropy H(q): q := argmax_q' H(q) H(q) is the entropy of a particular pdf q: H(q) := -Integrate[q(x)log q(x)]dx So there is very little similarity between MaxEnt and ML. Again, EM is only a particular approach to ML. There are some connections, but they're a tad more intricate (Csiszar & Tusnady 1984). -- mag. Aleks Jakulin http://ai.fri.uni-lj.si/aleks/ Artificial Intelligence Laboratory, Faculty of Computer and Information Science, University of Ljubljana.
Dear Aleks Jakulin,

Thanks a lot for your invaluable discussion on the EM and Entropy. The
paper by Csiszar & Tusnady 1984 is very useful and talks exactly what
I am/was confused about.

In nutshell, the KL = -(differential entropy) + sum_of
(marginal_entropies)
Now if I assume that my data is iid then KL = 0, these definitions are
basically originating from BSS and ICA school.

And according to Csiszar an application of alternating minimization,
the second term in KL expansion can be considered as EM.

Thanks again for pointing to the Csiszar's paper :-)

Cheers,
Vimal