# Why minimising in the mean-error sense.

Started by December 2, 2003
```Hello group.

Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the
Wiener filter (chap. 2):

"We now summarize the essence of the filtering problem with the following
statement:

Design a linear discrete-time filter whose output y(n) provides an estimate
of a desired response d(n), given a set of input samples u(0), u(1),
u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
defined as the difference between the desired response d(n) and the actual
response y(n), is minimized."

Why is it that we use the minimized mean-square value of e(n) that gives the
optimum filter? Or more precisely, why use the quantity "mean-square value"
of a value - why not some, eg. mean-squareroot or whatever one's creativity
can come up with. What is the consequence of using the mean-square value?

--
Mvh / Best regards,
Jack, Copenhagen

The email address is for real. :)

```
```In article bw9zb.50644\$jf4.2789643@news000.worldonline.dk, Jack L. at
jack_nospam@nospam.dk wrote on 12/02/2003 18:47:

> Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the
> Wiener filter (chap. 2):
>
> "We now summarize the essence of the filtering problem with the following
> statement:
>
> Design a linear discrete-time filter whose output y(n) provides an estimate
> of a desired response d(n), given a set of input samples u(0), u(1),
> u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
> defined as the difference between the desired response d(n) and the actual
> response y(n), is minimized."
>
> Why is it that we use the minimized mean-square value of e(n) that gives the
> optimum filter? Or more precisely, why use the quantity "mean-square value"
> of a value - why not some, eg. mean-squareroot or whatever one's creativity
> can come up with. What is the consequence of using the mean-square value?

very good and astute question.  sometime you should needle a prof with it (i
have).  i can think of two, maybe three answers:

1.  it isn't always the best.  in any optimization, you are minimizing the
norm of some error vector.  mean-square (or sum of squares) is often called
the Euclidian norm.  sum of absolute values is sometimes called the "taxicab
norm".  and there is also an important norm that simply chooses the element
of the vector that is the largest in magnitude sometimes called the
Chebyshev norm.  (minimizing that is often called a "minimax" or min-max
problem.)  the Parks-McClellan (or remez exchange) algorithm is there to
minimize the maximum error.  so not all problems are there for minimizing
the square error.

2.  deciding what is the best norm requires (human) knowledge about what is
important and what is less important.  if you are trying to minimize the
*energy* of the error, then you want the sum of square (or mean-square)
error to express the total effect of error.  perhaps you need to make sure
the maximum error is bounded, then you choose the Chebyshev norm.

3. the last reason minimizing the sum of square of error is often done is
because it is the easiest and most tractable mathematically:

error^2 = (result(parameters) - desired)^2

sum that expression over whatever domain you are looking at, take the
derivative with respect to each parameter, set all the derivatives to zero
and you will have an equal number of equations and unknowns that you can
often solve with linear algebra.

now try it with the taxicab or Chebyshev norm.  MUCH harder to do.

r b-j

```
```Jack L. wrote:

> Hello group.
>
> Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the
> Wiener filter (chap. 2):
>
> "We now summarize the essence of the filtering problem with the following
> statement:
>
> Design a linear discrete-time filter whose output y(n) provides an estimate
> of a desired response d(n), given a set of input samples u(0), u(1),
> u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
> defined as the difference between the desired response d(n) and the actual
> response y(n), is minimized."
>
> Why is it that we use the minimized mean-square value of e(n) that gives the
> optimum filter? Or more precisely, why use the quantity "mean-square value"
> of a value - why not some, eg. mean-squareroot or whatever one's creativity
> can come up with. What is the consequence of using the mean-square value?

Squaring the error does two things: it treats negative and positive
errors identically, and it weights large errors more heavily than small
ones. Using absolute value accomplishes the first, but not the second.
Weighting large errors more heavily than small ones tends to minimize
the worst-case error achieved with a given number of degrees of freedom.
For simple minimization, the mean square is fine. To estimate goodness
of fit, the square root of that is often used. That preserves the
original dimensions of the variable and removes the distortion
introduced by squaring. The same criteria are used with linear and
higher-order regression, and with many other statistical treatments of
estimation and error analysis.

Jerry
--
Engineering is the art of making what you want from things you can get.
&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;&#2013266095;

```
```"Jack L." <jack_nospam@nospam.dk> wrote in message
news:bw9zb.50644\$jf4.2789643@news000.worldonline.dk...
> [...]
>
> Why is it that we use the minimized mean-square value of e(n) that gives
the
> optimum filter? Or more precisely, why use the quantity "mean-square
value"
> of a value - why not some, eg. mean-squareroot or whatever one's
creativity
> can come up with. What is the consequence of using the mean-square value?

Hello Jack,

There are many reasons for minimizing mean squared error.  Before I start on
the list, note that it's the same problem as minimizing the total squared
error for a fixed number of sample points.

- Firstly, it's really easy to do for linear systems -- just find the point
at which all the partial derivatives of the error are zero.  That boils down
to an easily solvable system of linear equations.

- It favours many small errors over a few big ones that add up to the same
value linearly, which is usually what you want.

- The total squared error over N points is the same as the Euclidean
distance in N dimensions, which makes it an inutitively attractive metric.

- Mean squared error is the error "variance", which grows monotonically with
standard deviation, so minimizing squared error also minimizes these
commonly used statistical measures.

- If the error is a voltage, the mean squared error is the power of the
error signal, and total squared error is the energy of the error signal,
which corresponds nicely to the way engineers measure lots of things.

- If you measure a vector with any set of orthogonal basis vectors and
reconstruct it, what you get is the approximation that minimizes the squared
error, so if you make that your goal, you're done.

- Orthogonal transforms that preserve energy, like the Fourier transform,
also preserve squared error.  So if your goal is to minimize squared error,
you can do it in the time or frequency domain and it will mean the same
thing.  You can, in fact, combine time and frequency-based goals  and
minimize the error w.r.t. both simultaneously without difficulty.

- Usually, what we call the "average" anything is the estimate that
minimizes the squred error if you use the same measuring system that you
used to calculate the average.  So when we intuitively want the "average"
something-more-complicated, we tend to find the estimate that minimizes the
squared error, unless we have a reason to set a different goal.

Finally, note that in digital filter design, mean-squared-error is actually
*not* the most commonly used metric ;-)  In this case, the goal is usually
to minimize the maximum absolute error over the frequencies of interest.

```
```
"Jack L." wrote:
>
> Hello group.
>
> Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the
> Wiener filter (chap. 2):
>
> "We now summarize the essence of the filtering problem with the following
> statement:
>
> Design a linear discrete-time filter whose output y(n) provides an estimate
> of a desired response d(n), given a set of input samples u(0), u(1),
> u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
> defined as the difference between the desired response d(n) and the actual
> response y(n), is minimized."
>
> Why is it that we use the minimized mean-square value of e(n) that gives the
> optimum filter? Or more precisely, why use the quantity "mean-square value"
> of a value - why not some, eg. mean-squareroot or whatever one's creativity
> can come up with. What is the consequence of using the mean-square value?

The mathematical reason of the use of the mean square error term is that
when minimizing the target function you will get the linear system of
equations. That makes the solution really nice and easy. Any other
definition of an error term will lead to nonlinear system of equations
which is difficult to solve.

The physical reason is that the square represents the energy term, since
everything around is proportional to the energy.

DSP and Mixed Signal Design Consultant

http://www.abvolt.com
```
```MSE is used because it is easy to analyze it on paper and gives nice closed
form solutions.
you can refer to the papers that minimize entropy of the error thereby
making them more "white"

"Jack L." <jack_nospam@nospam.dk> wrote in message
news:bw9zb.50644\$jf4.2789643@news000.worldonline.dk...
> Hello group.
>
> Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on
the
> Wiener filter (chap. 2):
>
> "We now summarize the essence of the filtering problem with the following
> statement:
>
> Design a linear discrete-time filter whose output y(n) provides an
estimate
> of a desired response d(n), given a set of input samples u(0), u(1),
> u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
> defined as the difference between the desired response d(n) and the actual
> response y(n), is minimized."
>
> Why is it that we use the minimized mean-square value of e(n) that gives
the
> optimum filter? Or more precisely, why use the quantity "mean-square
value"
> of a value - why not some, eg. mean-squareroot or whatever one's
creativity
> can come up with. What is the consequence of using the mean-square value?
>
> --
> Mvh / Best regards,
> Jack, Copenhagen
>
> The email address is for real. :)
>
>
>

```
```MSE is used because it is easy to analyze it on paper and gives nice closed
form solutions.
you can refer to the papers that minimize entropy of the error thereby
making them more "white"

"Jack L." <jack_nospam@nospam.dk> wrote in message
news:bw9zb.50644\$jf4.2789643@news000.worldonline.dk...
> Hello group.
>
> Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on
the
> Wiener filter (chap. 2):
>
> "We now summarize the essence of the filtering problem with the following
> statement:
>
> Design a linear discrete-time filter whose output y(n) provides an
estimate
> of a desired response d(n), given a set of input samples u(0), u(1),
> u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
> defined as the difference between the desired response d(n) and the actual
> response y(n), is minimized."
>
> Why is it that we use the minimized mean-square value of e(n) that gives
the
> optimum filter? Or more precisely, why use the quantity "mean-square
value"
> of a value - why not some, eg. mean-squareroot or whatever one's
creativity
> can come up with. What is the consequence of using the mean-square value?
>
> --
> Mvh / Best regards,
> Jack, Copenhagen
>
> The email address is for real. :)
>
>
>

```
```Jack L. wrote:
> Hello group.
>
> Why is it that we use the minimized mean-square value of e(n) that
> gives the optimum filter? Or more precisely, why use the quantity
> "mean-square value" of a value - why not some, eg. mean-squareroot or
> whatever one's creativity can come up with. What is the consequence
> of using the mean-square value?

I thank you all for the good answers. :)

--
Mvh / Best regards,
Jack, Copenhagen

The email address is for real. :)

```
```
Matt Timmermans wrote:
>
> - Orthogonal transforms that preserve energy, like the Fourier transform,
> also preserve squared error.  So if your goal is to minimize squared error,
> you can do it in the time or frequency domain and it will mean the same
> thing.  You can, in fact, combine time and frequency-based goals  and
> minimize the error w.r.t. both simultaneously without difficulty.
>

Matt, thanks for the great list of attributes.  I wonder if
you could expand on this one a bit.  I'm trying to determine
whether it is better to form a transfer function impulse
response by either the complex division of the transforms of
the numerator and denominator impulse responses followed by
the inverse transform or by doing the division in the time
domain with a Toeplitz solver (which minimizes the squared
error in the time domain at the expense of an O(N^2)
calculation.)  If you have insight on the pros and cons of
the two approaches or can say more about how to combine
methods for better or faster results I'd greatly appreciate
it.

Bob
--

"Things should be described as simply as possible, but no
simpler."

A. Einstein
```
```
"Jack L." wrote:

> Hello group.
>
> Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the
> Wiener filter (chap. 2):
>
> "We now summarize the essence of the filtering problem with the following
> statement:
>
> Design a linear discrete-time filter whose output y(n) provides an estimate
> of a desired response d(n), given a set of input samples u(0), u(1),
> u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
> defined as the difference between the desired response d(n) and the actual
> response y(n), is minimized."
>
> Why is it that we use the minimized mean-square value of e(n) that gives the
> optimum filter? Or more precisely, why use the quantity "mean-square value"
> of a value - why not some, eg. mean-squareroot or whatever one's creativity
> can come up with. What is the consequence of using the mean-square value?
>
> --
> Mvh / Best regards,
> Jack, Copenhagen
>
> The email address is for real. :)

I have seen many other criteria -for instance E[e^4(t)], modulus of  error and
of course H infinity optimisation.
For a dc free error the mean squared value is average power of the error so it
makes some engineering sense.
(its also easy to get an answer!)
Tom

```