Why minimising in the mean-error sense.

Started by Jack L. December 2, 2003
Hello group.

Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the
Wiener filter (chap. 2):

"We now summarize the essence of the filtering problem with the following
statement:

Design a linear discrete-time filter whose output y(n) provides an estimate
of a desired response d(n), given a set of input samples u(0), u(1),
u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n),
defined as the difference between the desired response d(n) and the actual
response y(n), is minimized."

Why is it that we use the minimized mean-square value of e(n) that gives the
optimum filter? Or more precisely, why use the quantity "mean-square value"
of a value - why not some, eg. mean-squareroot or whatever one's creativity
can come up with. What is the consequence of using the mean-square value?

--
Mvh / Best regards,
Jack, Copenhagen

The email address is for real. :)



In article bw9zb.50644$jf4.2789643@news000.worldonline.dk, Jack L. at
jack_nospam@nospam.dk wrote on 12/02/2003 18:47:

> Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the > Wiener filter (chap. 2): > > "We now summarize the essence of the filtering problem with the following > statement: > > Design a linear discrete-time filter whose output y(n) provides an estimate > of a desired response d(n), given a set of input samples u(0), u(1), > u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n), > defined as the difference between the desired response d(n) and the actual > response y(n), is minimized." > > Why is it that we use the minimized mean-square value of e(n) that gives the > optimum filter? Or more precisely, why use the quantity "mean-square value" > of a value - why not some, eg. mean-squareroot or whatever one's creativity > can come up with. What is the consequence of using the mean-square value?
very good and astute question. sometime you should needle a prof with it (i have). i can think of two, maybe three answers: 1. it isn't always the best. in any optimization, you are minimizing the norm of some error vector. mean-square (or sum of squares) is often called the Euclidian norm. sum of absolute values is sometimes called the "taxicab norm". and there is also an important norm that simply chooses the element of the vector that is the largest in magnitude sometimes called the Chebyshev norm. (minimizing that is often called a "minimax" or min-max problem.) the Parks-McClellan (or remez exchange) algorithm is there to minimize the maximum error. so not all problems are there for minimizing the square error. 2. deciding what is the best norm requires (human) knowledge about what is important and what is less important. if you are trying to minimize the *energy* of the error, then you want the sum of square (or mean-square) error to express the total effect of error. perhaps you need to make sure the maximum error is bounded, then you choose the Chebyshev norm. 3. the last reason minimizing the sum of square of error is often done is because it is the easiest and most tractable mathematically: error^2 = (result(parameters) - desired)^2 sum that expression over whatever domain you are looking at, take the derivative with respect to each parameter, set all the derivatives to zero and you will have an equal number of equations and unknowns that you can often solve with linear algebra. now try it with the taxicab or Chebyshev norm. MUCH harder to do. r b-j
Jack L. wrote:

> Hello group. > > Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the > Wiener filter (chap. 2): > > "We now summarize the essence of the filtering problem with the following > statement: > > Design a linear discrete-time filter whose output y(n) provides an estimate > of a desired response d(n), given a set of input samples u(0), u(1), > u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n), > defined as the difference between the desired response d(n) and the actual > response y(n), is minimized." > > Why is it that we use the minimized mean-square value of e(n) that gives the > optimum filter? Or more precisely, why use the quantity "mean-square value" > of a value - why not some, eg. mean-squareroot or whatever one's creativity > can come up with. What is the consequence of using the mean-square value?
Squaring the error does two things: it treats negative and positive errors identically, and it weights large errors more heavily than small ones. Using absolute value accomplishes the first, but not the second. Weighting large errors more heavily than small ones tends to minimize the worst-case error achieved with a given number of degrees of freedom. For simple minimization, the mean square is fine. To estimate goodness of fit, the square root of that is often used. That preserves the original dimensions of the variable and removes the distortion introduced by squaring. The same criteria are used with linear and higher-order regression, and with many other statistical treatments of estimation and error analysis. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
"Jack L." <jack_nospam@nospam.dk> wrote in message
news:bw9zb.50644$jf4.2789643@news000.worldonline.dk...
> [...] > > Why is it that we use the minimized mean-square value of e(n) that gives
the
> optimum filter? Or more precisely, why use the quantity "mean-square
value"
> of a value - why not some, eg. mean-squareroot or whatever one's
creativity
> can come up with. What is the consequence of using the mean-square value?
Hello Jack, There are many reasons for minimizing mean squared error. Before I start on the list, note that it's the same problem as minimizing the total squared error for a fixed number of sample points. - Firstly, it's really easy to do for linear systems -- just find the point at which all the partial derivatives of the error are zero. That boils down to an easily solvable system of linear equations. - It favours many small errors over a few big ones that add up to the same value linearly, which is usually what you want. - The total squared error over N points is the same as the Euclidean distance in N dimensions, which makes it an inutitively attractive metric. - Mean squared error is the error "variance", which grows monotonically with standard deviation, so minimizing squared error also minimizes these commonly used statistical measures. - If the error is a voltage, the mean squared error is the power of the error signal, and total squared error is the energy of the error signal, which corresponds nicely to the way engineers measure lots of things. - If you measure a vector with any set of orthogonal basis vectors and reconstruct it, what you get is the approximation that minimizes the squared error, so if you make that your goal, you're done. - Orthogonal transforms that preserve energy, like the Fourier transform, also preserve squared error. So if your goal is to minimize squared error, you can do it in the time or frequency domain and it will mean the same thing. You can, in fact, combine time and frequency-based goals and minimize the error w.r.t. both simultaneously without difficulty. - Usually, what we call the "average" anything is the estimate that minimizes the squred error if you use the same measuring system that you used to calculate the average. So when we intuitively want the "average" something-more-complicated, we tend to find the estimate that minimizes the squared error, unless we have a reason to set a different goal. Finally, note that in digital filter design, mean-squared-error is actually *not* the most commonly used metric ;-) In this case, the goal is usually to minimize the maximum absolute error over the frequencies of interest.

"Jack L." wrote:
> > Hello group. > > Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the > Wiener filter (chap. 2): > > "We now summarize the essence of the filtering problem with the following > statement: > > Design a linear discrete-time filter whose output y(n) provides an estimate > of a desired response d(n), given a set of input samples u(0), u(1), > u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n), > defined as the difference between the desired response d(n) and the actual > response y(n), is minimized." > > Why is it that we use the minimized mean-square value of e(n) that gives the > optimum filter? Or more precisely, why use the quantity "mean-square value" > of a value - why not some, eg. mean-squareroot or whatever one's creativity > can come up with. What is the consequence of using the mean-square value?
The mathematical reason of the use of the mean square error term is that when minimizing the target function you will get the linear system of equations. That makes the solution really nice and easy. Any other definition of an error term will lead to nonlinear system of equations which is difficult to solve. The physical reason is that the square represents the energy term, since everything around is proportional to the energy. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
MSE is used because it is easy to analyze it on paper and gives nice closed
form solutions.
you can refer to the papers that minimize entropy of the error thereby
making them more "white"


"Jack L." <jack_nospam@nospam.dk> wrote in message
news:bw9zb.50644$jf4.2789643@news000.worldonline.dk...
> Hello group. > > Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on
the
> Wiener filter (chap. 2): > > "We now summarize the essence of the filtering problem with the following > statement: > > Design a linear discrete-time filter whose output y(n) provides an
estimate
> of a desired response d(n), given a set of input samples u(0), u(1), > u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n), > defined as the difference between the desired response d(n) and the actual > response y(n), is minimized." > > Why is it that we use the minimized mean-square value of e(n) that gives
the
> optimum filter? Or more precisely, why use the quantity "mean-square
value"
> of a value - why not some, eg. mean-squareroot or whatever one's
creativity
> can come up with. What is the consequence of using the mean-square value? > > -- > Mvh / Best regards, > Jack, Copenhagen > > The email address is for real. :) > > >
MSE is used because it is easy to analyze it on paper and gives nice closed
form solutions.
you can refer to the papers that minimize entropy of the error thereby
making them more "white"


"Jack L." <jack_nospam@nospam.dk> wrote in message
news:bw9zb.50644$jf4.2789643@news000.worldonline.dk...
> Hello group. > > Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on
the
> Wiener filter (chap. 2): > > "We now summarize the essence of the filtering problem with the following > statement: > > Design a linear discrete-time filter whose output y(n) provides an
estimate
> of a desired response d(n), given a set of input samples u(0), u(1), > u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n), > defined as the difference between the desired response d(n) and the actual > response y(n), is minimized." > > Why is it that we use the minimized mean-square value of e(n) that gives
the
> optimum filter? Or more precisely, why use the quantity "mean-square
value"
> of a value - why not some, eg. mean-squareroot or whatever one's
creativity
> can come up with. What is the consequence of using the mean-square value? > > -- > Mvh / Best regards, > Jack, Copenhagen > > The email address is for real. :) > > >
Jack L. wrote:
> Hello group. > > Why is it that we use the minimized mean-square value of e(n) that > gives the optimum filter? Or more precisely, why use the quantity > "mean-square value" of a value - why not some, eg. mean-squareroot or > whatever one's creativity can come up with. What is the consequence > of using the mean-square value?
I thank you all for the good answers. :) -- Mvh / Best regards, Jack, Copenhagen The email address is for real. :)

Matt Timmermans wrote:
> > - Orthogonal transforms that preserve energy, like the Fourier transform, > also preserve squared error. So if your goal is to minimize squared error, > you can do it in the time or frequency domain and it will mean the same > thing. You can, in fact, combine time and frequency-based goals and > minimize the error w.r.t. both simultaneously without difficulty. >
Matt, thanks for the great list of attributes. I wonder if you could expand on this one a bit. I'm trying to determine whether it is better to form a transfer function impulse response by either the complex division of the transforms of the numerator and denominator impulse responses followed by the inverse transform or by doing the division in the time domain with a Toeplitz solver (which minimizes the squared error in the time domain at the expense of an O(N^2) calculation.) If you have insight on the pros and cons of the two approaches or can say more about how to combine methods for better or faster results I'd greatly appreciate it. Bob -- "Things should be described as simply as possible, but no simpler." A. Einstein

"Jack L." wrote:

> Hello group. > > Quote from Simon Haykin's book "Adaptive Filter Theory", 4th edition, on the > Wiener filter (chap. 2): > > "We now summarize the essence of the filtering problem with the following > statement: > > Design a linear discrete-time filter whose output y(n) provides an estimate > of a desired response d(n), given a set of input samples u(0), u(1), > u(2),..., such that the MEAN-SQUARE VALUE of the estimation error e(n), > defined as the difference between the desired response d(n) and the actual > response y(n), is minimized." > > Why is it that we use the minimized mean-square value of e(n) that gives the > optimum filter? Or more precisely, why use the quantity "mean-square value" > of a value - why not some, eg. mean-squareroot or whatever one's creativity > can come up with. What is the consequence of using the mean-square value? > > -- > Mvh / Best regards, > Jack, Copenhagen > > The email address is for real. :)
I have seen many other criteria -for instance E[e^4(t)], modulus of error and of course H infinity optimisation. For a dc free error the mean squared value is average power of the error so it makes some engineering sense. (its also easy to get an answer!) Tom