The following is the optimum weighted least squares solution, valid whether or not components of "e" have uniform variance or are correlated.  Let "U" be the inverse square root of the covariance matrix of "e" -- then use Uy = UXc + Ue   and the WLS solution is   c = (UX)#Uy   where (UX)# is the Moore-Penrose Pseudo Inverse, also called the generalized inverse, of the product (UX).
The generalized inverse of a real matrix "A" is the limit, as "z" approaches zero, of the product [inverse{A'A + zI}]A'  where A' is the transpose of "A"  -- so much for the good news.
Now it's time to say what a generalized inverse can and can't do well.  It provides a solution, even when X is singular (in which case it estimates the part of "c" that's observable and leaves the rest alone).  However I don't recommend using it to evaluate that solution's performance.  The covariance matrix of error in that estimate is [inverse{(UX)'(UX)}]  -- obviously that will exist only if the contents within { } is nonsingular.  Furthermore, even if not singular, an attempt to replace that inverse by a generalized inverse will deteriorate as conditioning becomes poorer.  In the limit it gives totally unwarranted optimism.  Think about it -- the generalized inverse of zero is zero.

On Wednesday, April 20, 2016 at 11:43:07 AM UTC-7, Peter Mairhofer wrote:

(snip)

> >> I have the following setup:

> >>        y = Xc + e

> >> where y is an N-dimensional observation vector, X an NxM matrix, c an M
> >> dimensional vector of parameters to be identified and e an N dimensional
> >> noise vector.

> >> I assume e to be iid white noise. Hence LS and ML are equivalent.

> >> Assuming M=1 and X1=X=1 (i.e., y = c + e) it is trivial to show:

> >>        Var{c} = 1/N Var{e}

> >> Meaning: The accuracy of the estimate is indirect proportional to the
> >> number of measurements I take.

> >> Is it possible to derive something similar for multiple parameters, M>1?


Again, I am still not sure about your question, but generalizing to M>1,
you have covariance, the off diagonal terms in the variance matrix.

It is covariance that makes the whole problem interesting, but I think
if the data is iid, then the covariance should be zero.

Hi Tim,

On 2016-04-20 9:48, Tim Wescott wrote:
> On Wed, 20 Apr 2016 02:19:29 -0700, Peter Mairhofer wrote:
> 
>> Hello,
>>
>> I have the following setup:
>>
>>        y = Xc + e
>>
>> where y is an N-dimensional observation vector, X an NxM matrix, c an M
>> dimensional vector of parameters to be identified and e an N dimensional
>> noise vector.
>>
>> I assume e to be iid white noise. Hence LS and ML are equivalent.
>>
>> Assuming M=1 and X1=X=1 (i.e., y = c + e) it is trivial to show:
>>
>>        Var{c} = 1/N Var{e}
>>
>> Meaning: The accuracy of the estimate is indirect proportional to the
>> number of measurements I take.
>>
>> Is it possible to derive something similar for multiple parameters, M>1?
>>
>>
>> I know it will include the condition number of X (or something about X).
>> Assuming I know it. How much I can trade the condition of X vs. required
>> numbers of observations for a desired Var{c}? E.g.: I could solve for
>> the unknowns using a badly conditioned system and take a certain amount
>> of observations. Alternatively, I could use a linear transform such that
>> the condition number of the system matrix decreases and hence in turn I
>> need fewer observations. (I can see that empirically!)
>>
>>
>> What confuses me on the way: Using the LS interpretation, the following
>> error bound is well known:
>>
>>       norm(cest-c)^2/norm(c)^2 <= kappa^2 * norm(e)^2/norm(Xc)^2
>>
>> where cest are the estimated coefficients using Moore-Penrose Pseudo
>> Inverse, c the actual coefficients without noise, kappa the condition
>> number of X.
>>
>> Assuming M=1, kappa=1 (this is the case for for any vector!) we have:
>>
>>       norm(cest-c)^2/norm(c)^2 <= norm(e)^2/norm(Xc)^2
>>
>> or in SNR [dB]:
>>
>>       SNR_c >= SNR_y
>>
>> This tells me that the SNR of my estimates (if I just interpret them as
>> a vector) is at least the SNR of my observations but they are
>> independent of N! This does not exclude the case from ML where SNR_c =
>> SNR_y + 20*log(N) but it is still independent from N.
>>
>> Why?
>>
>> Maybe the LS approach does not help me here ...
>>
>> Assumption: ML approach gives me expected outcomes (i.e. average).
>> LS approach derives worst-case bound.
>>
>> But there's gotta be a way to express the expected error on c as a
>> function of the variance of the observation noise and X. HOW?
>>
>>
>> Thanks,
>> Peter
>>
>> PS: X'post to sci.stat.math & comp.dsp
> 
> There's something about your posts that always leads me to ask "what are 
> you really doing?"
> 
> Is X exactly known?  Is it square?  You mention condition number, so I 
> assume that you're not sure if it is of full rank.  If X is over-
> determined, is it consistent?

Sorry if I have been unclear enough:

1.) As written above, X is not square clearly N > M (more observations
than unknowns)

2.) I am sure it's full rank but no other assumptions on condition.
Matrix may have a high condition number

3.) Since N > M, X is clearly overdetermined

To summarize my question:

* LS gives me a *worst case* bound for this setup: SNR_estimate >=
SNR_observations - 20*log(condition_number). To first order, this is
independent of the number of *observations*

* ML for M=1 and X=1 gives me the *average* case (which I'm interested
in). Var(c) = 1/N * Var(e) (trivial to show).

* I have not been able to generalize ML for arbitrary X

* My idea was to use the LS interpretation to come up with an expression
containing the number of observations. This may generally not be the
right approach.

> If you were paying me money to come up with an answer I'd hit my 
> statistics book and do some studying -- this would not be a bad thing for 
> you to do.

This is in progress but I havent' been successful yet. Generally I am
not sure if I should take the linear algebra approach (LS) or the
statistical approach (ML).

In the former I can find discussions for arbitrary X but it lacks
statistics (hence I think it derives worst-case bounds rather than
average error).

For the latter I always find trivial setups with one variable and no
discussion about the average error and I am unable to generalize this to
arbitrary X.

> If X is exactly known and well-conditioned, but over-determined,

X is exactly known, over-determined by may be poorly conditioned (e.g.
some Vandermonde structure).

> then I 
> think that -- taking X^{-1} as the pseudo-inverse of X -- that the answer 
> I gave to Glen is pretty close:
> 
> hat{c} = X^{-1}y
> 
> where hat{c} is your estimate of c,

Yes, I use the Pseudo Inverse to solve it. It's full rank, despite
poorly conditioned.

I *think* that poor condition number does not necessarily mean noise
amplification but only increases worst case error. It may be that X is
poorly conditioned but hat{c} = X^{-1}y still gives good results on
average if c does not lie in directions with low singular values.

> and the covariance of hat{c} is
> 
> P_c = X^{-1}P_e(X^{-1})^T

Thank you, I think this gives me some ideas for further analysis!!

It is also consistent with the result of the ML analysis:

Suppose P_e = I*sigma^2 (i.e., the noise is mean-free iid Gaussian), then

P_c = (X^T * X)^-1 * sigma^2

Now assume M=1 and X = [ 1 1 ... 1 ]^T

then

P_c = diag(Var(c)) = 1/N * sigma^2

as expected :-)

If M=1 (still one parameter) but X=x is an arbitrary N-length vector, then

P_c = diag(Var(c)) = 1/norm(x)^2 * sigma^2

i.e., the error depends on the "SNR" between the input data and the
noise which indirectly encodes N. This makes sense because by making x
big, sigma^2 can be made relatively small.

Now Assuming x is itself iid Gaussian with variance sigma_x^2, then

Var(c) = 1/N * sigma^2/sigma_x^2

which again includes my sought 1/N term.

Further notes to myself:

1.) The reason why I am not able to match the ML result and the LS
result (SNR_c >= SNR_y) as written in the OP is that ML does average
error analysis and the LS formula I used did worst-case analysis

2.) For a general matrix X, it seems literally impossible to describe
Var(c) directly in terms of 1/N, the number of measurements. It is
implicitely encoded in (X^T*X)^-1

> If X is not well-conditioned, then unless you can change it, the variance 
> of c is just going to go up, in the directions where X is close to 
> singular.

Sure.

> If X is _not_ known exactly, then you're going to get into ever-more-
> funny distributions for c.

Fortunately I know X precisely.

The one other setup I want investigate next is the case of
multiplicative noise:

Instead of

   y = Xc + e

I have

   y = Xc*(1+e) = Xc + X(c o e) = Xc + n

where the "noise" is now n=X(c o e) (where "o" denotes the Hadamard
product - component-wise multiplication). In this case the actual noise
is correlated with X and c. (and e will be colored as well).

If you (or someone else) has any pointers for this case - would be much
appreciated.

Thanks, your respons(es) were helpful.

Peter

On Wednesday, April 20, 2016 at 2:19:32 AM UTC-7, Peter Mairhofer wrote:

> I have the following setup:

>        y = Xc + e

> where y is an N-dimensional observation vector, X an NxM matrix, c an M
> dimensional vector of parameters to be identified and e an N dimensional
> noise vector.

As above, I am not so sure which question you are actually trying to answer,
but you might look at:

http://home.gwu.edu/~hgrie/lectures/nupa-II15/nupa-II15.slides11.pdf

this is the case for a non-diagonal covariance matrix.

On Wed, 20 Apr 2016 02:19:29 -0700, Peter Mairhofer wrote:

> Hello,
> 
> I have the following setup:
> 
>        y = Xc + e
> 
> where y is an N-dimensional observation vector, X an NxM matrix, c an M
> dimensional vector of parameters to be identified and e an N dimensional
> noise vector.
> 
> I assume e to be iid white noise. Hence LS and ML are equivalent.
> 
> Assuming M=1 and X1=X=1 (i.e., y = c + e) it is trivial to show:
> 
>        Var{c} = 1/N Var{e}
> 
> Meaning: The accuracy of the estimate is indirect proportional to the
> number of measurements I take.
> 
> Is it possible to derive something similar for multiple parameters, M>1?
> 
> 
> I know it will include the condition number of X (or something about X).
> Assuming I know it. How much I can trade the condition of X vs. required
> numbers of observations for a desired Var{c}? E.g.: I could solve for
> the unknowns using a badly conditioned system and take a certain amount
> of observations. Alternatively, I could use a linear transform such that
> the condition number of the system matrix decreases and hence in turn I
> need fewer observations. (I can see that empirically!)
> 
> 
> What confuses me on the way: Using the LS interpretation, the following
> error bound is well known:
> 
>       norm(cest-c)^2/norm(c)^2 <= kappa^2 * norm(e)^2/norm(Xc)^2
> 
> where cest are the estimated coefficients using Moore-Penrose Pseudo
> Inverse, c the actual coefficients without noise, kappa the condition
> number of X.
> 
> Assuming M=1, kappa=1 (this is the case for for any vector!) we have:
> 
>       norm(cest-c)^2/norm(c)^2 <= norm(e)^2/norm(Xc)^2
> 
> or in SNR [dB]:
> 
>       SNR_c >= SNR_y
> 
> This tells me that the SNR of my estimates (if I just interpret them as
> a vector) is at least the SNR of my observations but they are
> independent of N! This does not exclude the case from ML where SNR_c =
> SNR_y + 20*log(N) but it is still independent from N.
> 
> Why?
> 
> Maybe the LS approach does not help me here ...
> 
> Assumption: ML approach gives me expected outcomes (i.e. average).
> LS approach derives worst-case bound.
> 
> But there's gotta be a way to express the expected error on c as a
> function of the variance of the observation noise and X. HOW?
> 
> 
> Thanks,
> Peter
> 
> PS: X'post to sci.stat.math & comp.dsp

There's something about your posts that always leads me to ask "what are 
you really doing?"

Is X exactly known?  Is it square?  You mention condition number, so I 
assume that you're not sure if it is of full rank.  If X is over-
determined, is it consistent?

If you were paying me money to come up with an answer I'd hit my 
statistics book and do some studying -- this would not be a bad thing for 
you to do.

If X is exactly known and well-conditioned, but over-determined, then I 
think that -- taking X^{-1} as the pseudo-inverse of X -- that the answer 
I gave to Glen is pretty close:

hat{c} = X^{-1}y

where hat{c} is your estimate of c, and the covariance of hat{c} is

P_c = X^{-1}P_e(X^{-1})^T

If X is not well-conditioned, then unless you can change it, the variance 
of c is just going to go up, in the directions where X is close to 
singular.

If X is _not_ known exactly, then you're going to get into ever-more-
funny distributions for c.

If the statistics group is alive expect this all to be rather strenuously 
clarified.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

On Wed, 20 Apr 2016 04:19:41 -0700, herrmannsfeldt wrote:

> On Wednesday, April 20, 2016 at 2:19:32 AM UTC-7, Peter Mairhofer wrote:
> 
>> I have the following setup:
> 
>>        y = Xc + e
> 
>> where y is an N-dimensional observation vector, X an NxM matrix, c an M
>> dimensional vector of parameters to be identified and e an N
>> dimensional noise vector.
>  
>> I assume e to be iid white noise. Hence LS and ML are equivalent.
>  
>> Assuming M=1 and X1=X=1 (i.e., y = c + e) it is trivial to show:
>  
>>        Var{c} = 1/N Var{e}
>  
>> Meaning: The accuracy of the estimate is indirect proportional to the
>> number of measurements I take.
> 
> I usually say that SD(c) = SD(e)/sqrt(N), where SD is the Standard
> Deviation,
> square root of Variance (depending on where you put your N-1).
> 
> Since SD(c) has the same units as c and e have, it is an estimate of the
> uncertainty in c, where Var(c) is for c squared.
>  
>> Is it possible to derive something similar for multiple parameters,
>> M>1?
>  
>> I know it will include the condition number of X (or something about
>> X).
>> Assuming I know it. How much I can trade the condition of X vs.
>> required numbers of observations for a desired Var{c}? E.g.: I could
>> solve for the unknowns using a badly conditioned system and take a
>> certain amount of observations. Alternatively, I could use a linear
>> transform such that the condition number of the system matrix decreases
>> and hence in turn I need fewer observations. (I can see that
>> empirically!)
> 
> (snip)
> 
> I am not so sure that I understand your question, but it reminds me of
> Chi square.
> 
> As well as I know it, Chi square works in n dimensional problems in a
> similar way that a Gaussian works in 1 dimensional problems.
> 
> Also, for n dimensions you have an n by n covariance matrix.

I'm surprised, Glen.

For a square, non-singular X, c = X^{-1}(y - e), hat{c} = X^{-1}y

The expected error in c is Gaussian, with a covariance matrix equal to

P_c = X^{-1}P_e(X^{-1})^T

with the "^T" meaning "transform", and P_e being the covariance of the 
error vector.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Hello,

On 2016-04-20 4:19, herrmannsfeldt@gmail.com wrote:
> On Wednesday, April 20, 2016 at 2:19:32 AM UTC-7, Peter Mairhofer wrote:
> 
>> I have the following setup:
> 
>>        y = Xc + e
> 
>> where y is an N-dimensional observation vector, X an NxM matrix, c an M
>> dimensional vector of parameters to be identified and e an N dimensional
>> noise vector.
>  
>> I assume e to be iid white noise. Hence LS and ML are equivalent.
>  
>> Assuming M=1 and X1=X=1 (i.e., y = c + e) it is trivial to show:
>  
>>        Var{c} = 1/N Var{e}
>  
>> Meaning: The accuracy of the estimate is indirect proportional to the
>> number of measurements I take.
> 
> I usually say that SD(c) = SD(e)/sqrt(N), where SD is the Standard Deviation,
> square root of Variance (depending on where you put your N-1).
> 
> Since SD(c) has the same units as c and e have, it is an estimate of
> the uncertainty in c, where Var(c) is for c squared.

Sure, both is perfectly valid.

>> Is it possible to derive something similar for multiple parameters, M>1?
>  
>> I know it will include the condition number of X (or something about X).
>> Assuming I know it. How much I can trade the condition of X vs. required
>> numbers of observations for a desired Var{c}? E.g.: I could solve for
>> the unknowns using a badly conditioned system and take a certain amount
>> of observations. Alternatively, I could use a linear transform such that
>> the condition number of the system matrix decreases and hence in turn I
>> need fewer observations. (I can see that empirically!)
> 
> (snip)
> 
> I am not so sure that I understand your question,

For

    y = Xc + e

as defined above, when I have N observations (X has N rows) and M
unknowns (M columns), the variances in the error won't depend on N only.
At least it must include M. But I think it must also include X itself.

From Least Squares we know that the worst-case error is given by
Input-SNR multiplied by the condition number (see original post).

So what about the "average"?

For the linear transform: Suppose I use the DFT together with a random
permutation matrix to transform the system to

    Y = Pc + E

where the the SNR between Y and E is the same as between y and e.

If I now sweep N for both cases and plot norm(c-cest)/norm(c) (i.e., the
relative reconstruction error) I can observe that the curves behave
completely different: The error in the second variant drops much faster,
due to the smaller condition number of P compared to X.

Btw, both cases are *not* proportional to 1/N. When I Set c=1, it
follows the 1/N as discussed above for ML estimator.

> but it reminds me of Chi square. 
> 
> As well as I know it, Chi square works in n dimensional problems in 
> a similar way that a Gaussian works in 1 dimensional problems. 
> 
> Also, for n dimensions you have an n by n covariance matrix.

Can you elaborate what spcifically you mean?

Do you mean something like. If I start from:

y1 = x11*c1 + x12*c2 + e1
y2 = x21*c1 + x22*c2 + e2
y3 = x31*c1 + x32*c2 + e3
y4 = x41*c1 + x42*c2 + e4
y5 = x51*c1 + x52*c2 + e5

can't I just use the likelihood of the multivariate Gauss distribution?
How would I obtain the expression for the variances of c (the estimation
error) then?

Generally I miss how the condition number (or something related) would
enter the expression, but maybe this is not necessary ...

Thanks,
Peter

On Wednesday, April 20, 2016 at 2:19:32 AM UTC-7, Peter Mairhofer wrote:

> I have the following setup:

>        y = Xc + e

> where y is an N-dimensional observation vector, X an NxM matrix, c an M
> dimensional vector of parameters to be identified and e an N dimensional
> noise vector.

> I assume e to be iid white noise. Hence LS and ML are equivalent.

> Assuming M=1 and X1=X=1 (i.e., y = c + e) it is trivial to show:

>        Var{c} = 1/N Var{e}

> Meaning: The accuracy of the estimate is indirect proportional to the
> number of measurements I take.

I usually say that SD(c) = SD(e)/sqrt(N), where SD is the Standard Deviation,
square root of Variance (depending on where you put your N-1).

Since SD(c) has the same units as c and e have, it is an estimate of
the uncertainty in c, where Var(c) is for c squared.

> Is it possible to derive something similar for multiple parameters, M>1?

> I know it will include the condition number of X (or something about X).
> Assuming I know it. How much I can trade the condition of X vs. required
> numbers of observations for a desired Var{c}? E.g.: I could solve for
> the unknowns using a badly conditioned system and take a certain amount
> of observations. Alternatively, I could use a linear transform such that
> the condition number of the system matrix decreases and hence in turn I
> need fewer observations. (I can see that empirically!)

(snip)

I am not so sure that I understand your question, but it reminds me 
of Chi square. 

As well as I know it, Chi square works in n dimensional problems in 
a similar way that a Gaussian works in 1 dimensional problems. 

Also, for n dimensions you have an n by n covariance matrix.

Hello,

I have the following setup:

       y = Xc + e

where y is an N-dimensional observation vector, X an NxM matrix, c an M
dimensional vector of parameters to be identified and e an N dimensional
noise vector.

I assume e to be iid white noise. Hence LS and ML are equivalent.

Assuming M=1 and X1=X=1 (i.e., y = c + e) it is trivial to show:

       Var{c} = 1/N Var{e}

Meaning: The accuracy of the estimate is indirect proportional to the
number of measurements I take.

Is it possible to derive something similar for multiple parameters, M>1?


I know it will include the condition number of X (or something about X).
Assuming I know it. How much I can trade the condition of X vs. required
numbers of observations for a desired Var{c}? E.g.: I could solve for
the unknowns using a badly conditioned system and take a certain amount
of observations. Alternatively, I could use a linear transform such that
the condition number of the system matrix decreases and hence in turn I
need fewer observations. (I can see that empirically!)


What confuses me on the way: Using the LS interpretation, the following
error bound is well known:

      norm(cest-c)^2/norm(c)^2 <= kappa^2 * norm(e)^2/norm(Xc)^2

where cest are the estimated coefficients using Moore-Penrose Pseudo
Inverse, c the actual coefficients without noise, kappa the condition
number of X.

Assuming M=1, kappa=1 (this is the case for for any vector!) we have:

      norm(cest-c)^2/norm(c)^2 <= norm(e)^2/norm(Xc)^2

or in SNR [dB]:

      SNR_c >= SNR_y

This tells me that the SNR of my estimates (if I just interpret them as
a vector) is at least the SNR of my observations but they are
independent of N! This does not exclude the case from ML where SNR_c =
SNR_y + 20*log(N) but it is still independent from N.

Why?

Maybe the LS approach does not help me here ...

Assumption: ML approach gives me expected outcomes (i.e. average).
LS approach derives worst-case bound.

But there's gotta be a way to express the expected error on c as a
function of the variance of the observation noise and X. HOW?


Thanks,
Peter

PS: X'post to sci.stat.math & comp.dsp