COVID and statistics

Started by March 30, 2020
```I hope everyone is well.  This one is long and possibly boring.

Since now we are apparently all epidimologists I thought I'd give
it a try along with everyone else.  And I have three technical
questions in the below.

USA total, and USA by state statistics are available at the COVID
Tracking Project among other places.  I've seen a lot of actors
draw conclusions from such data using pure chartology - no
statistical analysis.

Typically they want to show that the epidemic is growing at different
rates in different locations (which it is).  But since the epidemic also
got started at different times in different locations, they use a
heuristic "zero day".

For example, the number of test positives reached 1.75 per million
population on March 7 in California, but not until March 9 in USA
overall.  Using this two day offset, the chartist can slide the curves
under comparison on top of each other for visual discussion.

(Before we get sidetracked here, I know there are huge numbers
of confounders, my questions are limited to just the mathematics
at hand.)

(1) First question.  Is there a more scientific way of selecting this
"zero day" other than by a randomly-chosen heuristic?

Taking the above data and dates, I tested the hypothesis of whether
the test positives have been growing at a different exponential rate
in California than in USA overall.  I did this by first taking the
log of the cumulative test positives; taking the first order difference
of this; and plugging this into a plain vanilla student t test.

The result is that with 99.4% probability, The California curve is
slower than the USA curve, at 0.087 decades per day as opposed to

(2) Second question.  A friend who works in health sciences suggests
a student T test is not a good choice and I should use an ARIMA
model to "correct for autocorellation on the regressions".
Myself, for this simple hypothesis, I'm not seeing this, but is my
friend's observation valid?

(Meta-question: is it common to whiten the data before performing
a statistical test, and if so, why?)

(Other meta-question, does not an ARIMA tool, after it's done munging
and massaging the data series, perform a statistical test such as Student t,
or a more generalized statistical test, anyway?)

On to the next question: I do want to improve upon my result. One
problem I have is that the early part of the series is noisier than
the later part of the series since there is more noise averaging
over 6000 positive tests (reported in California on March 29) than on 70
positives (reported on March 7).  (Under the assumption there is
fixed noise component on each individual test result.)

What I'd like to do is some form of Maximal Ratio Combining to fix this.
Which I have not yet done, but I can figure it out.

(3) Third question: is there a standard way of doing a statistical
test when the SNR in the data series is evolving over time?

Thanks much if you have gotten this far.

Steve
```
```On Tuesday, March 31, 2020 at 6:42:00 AM UTC+9, Steve Pope wrote:
> I hope everyone is well.  This one is long and possibly boring.
>
> Since now we are apparently all epidimologists I thought I'd give
> it a try along with everyone else.  And I have three technical
> questions in the below.
>
> USA total, and USA by state statistics are available at the COVID
> Tracking Project among other places.  I've seen a lot of actors
> draw conclusions from such data using pure chartology - no
> statistical analysis.
>
> Typically they want to show that the epidemic is growing at different
> rates in different locations (which it is).  But since the epidemic also
> got started at different times in different locations, they use a
> heuristic "zero day".
>
> For example, the number of test positives reached 1.75 per million
> population on March 7 in California, but not until March 9 in USA
> overall.  Using this two day offset, the chartist can slide the curves
> under comparison on top of each other for visual discussion.
>
> (Before we get sidetracked here, I know there are huge numbers
> of confounders, my questions are limited to just the mathematics
> at hand.)
>
> (1) First question.  Is there a more scientific way of selecting this
> "zero day" other than by a randomly-chosen heuristic?
>
> Taking the above data and dates, I tested the hypothesis of whether
> the test positives have been growing at a different exponential rate
> in California than in USA overall.  I did this by first taking the
> log of the cumulative test positives; taking the first order difference
> of this; and plugging this into a plain vanilla student t test.
>
> The result is that with 99.4% probability, The California curve is
> slower than the USA curve, at 0.087 decades per day as opposed to
>
> (2) Second question.  A friend who works in health sciences suggests
> a student T test is not a good choice and I should use an ARIMA
> model to "correct for autocorellation on the regressions".
> Myself, for this simple hypothesis, I'm not seeing this, but is my
> friend's observation valid?
>
> (Meta-question: is it common to whiten the data before performing
> a statistical test, and if so, why?)
>
> (Other meta-question, does not an ARIMA tool, after it's done munging
> and massaging the data series, perform a statistical test such as Student t,
> or a more generalized statistical test, anyway?)
>
> On to the next question: I do want to improve upon my result. One
> problem I have is that the early part of the series is noisier than
> the later part of the series since there is more noise averaging
> over 6000 positive tests (reported in California on March 29) than on 70
> positives (reported on March 7).  (Under the assumption there is
> fixed noise component on each individual test result.)
>
> What I'd like to do is some form of Maximal Ratio Combining to fix this.
> Which I have not yet done, but I can figure it out.
>
> (3) Third question: is there a standard way of doing a statistical
> test when the SNR in the data series is evolving over time?
>
> Thanks much if you have gotten this far.
>
> Steve

I think a Generalized Additive Model on the log of the data could help to determine whether the traces of two different states are different or not. You could incorporate the autocorrelation in the model and also compute the derivatives to see if it's getting slower or not. Check this out:
http://jacolienvanrij.com/Tutorials/GAMM.html#summed-effects-with-without-random-effects
```