I hope everyone is well. This one is long and possibly boring. Since now we are apparently all epidimologists I thought I'd give it a try along with everyone else. And I have three technical questions in the below. USA total, and USA by state statistics are available at the COVID Tracking Project among other places. I've seen a lot of actors draw conclusions from such data using pure chartology - no statistical analysis. Typically they want to show that the epidemic is growing at different rates in different locations (which it is). But since the epidemic also got started at different times in different locations, they use a heuristic "zero day". For example, the number of test positives reached 1.75 per million population on March 7 in California, but not until March 9 in USA overall. Using this two day offset, the chartist can slide the curves under comparison on top of each other for visual discussion. (Before we get sidetracked here, I know there are huge numbers of confounders, my questions are limited to just the mathematics at hand.) (1) First question. Is there a more scientific way of selecting this "zero day" other than by a randomly-chosen heuristic? Taking the above data and dates, I tested the hypothesis of whether the test positives have been growing at a different exponential rate in California than in USA overall. I did this by first taking the log of the cumulative test positives; taking the first order difference of this; and plugging this into a plain vanilla student t test. The result is that with 99.4% probability, The California curve is slower than the USA curve, at 0.087 decades per day as opposed to 0.120 decades per day. (2) Second question. A friend who works in health sciences suggests a student T test is not a good choice and I should use an ARIMA model to "correct for autocorellation on the regressions". Myself, for this simple hypothesis, I'm not seeing this, but is my friend's observation valid? (Meta-question: is it common to whiten the data before performing a statistical test, and if so, why?) (Other meta-question, does not an ARIMA tool, after it's done munging and massaging the data series, perform a statistical test such as Student t, or a more generalized statistical test, anyway?) On to the next question: I do want to improve upon my result. One problem I have is that the early part of the series is noisier than the later part of the series since there is more noise averaging over 6000 positive tests (reported in California on March 29) than on 70 positives (reported on March 7). (Under the assumption there is fixed noise component on each individual test result.) What I'd like to do is some form of Maximal Ratio Combining to fix this. Which I have not yet done, but I can figure it out. (3) Third question: is there a standard way of doing a statistical test when the SNR in the data series is evolving over time? Thanks much if you have gotten this far. Steve
COVID and statistics
Started by ●March 30, 2020
Reply by ●March 30, 20202020-03-30
On Tuesday, March 31, 2020 at 6:42:00 AM UTC+9, Steve Pope wrote:> I hope everyone is well. This one is long and possibly boring. > > Since now we are apparently all epidimologists I thought I'd give > it a try along with everyone else. And I have three technical > questions in the below. > > USA total, and USA by state statistics are available at the COVID > Tracking Project among other places. I've seen a lot of actors > draw conclusions from such data using pure chartology - no > statistical analysis. > > Typically they want to show that the epidemic is growing at different > rates in different locations (which it is). But since the epidemic also > got started at different times in different locations, they use a > heuristic "zero day". > > For example, the number of test positives reached 1.75 per million > population on March 7 in California, but not until March 9 in USA > overall. Using this two day offset, the chartist can slide the curves > under comparison on top of each other for visual discussion. > > (Before we get sidetracked here, I know there are huge numbers > of confounders, my questions are limited to just the mathematics > at hand.) > > (1) First question. Is there a more scientific way of selecting this > "zero day" other than by a randomly-chosen heuristic? > > Taking the above data and dates, I tested the hypothesis of whether > the test positives have been growing at a different exponential rate > in California than in USA overall. I did this by first taking the > log of the cumulative test positives; taking the first order difference > of this; and plugging this into a plain vanilla student t test. > > The result is that with 99.4% probability, The California curve is > slower than the USA curve, at 0.087 decades per day as opposed to > 0.120 decades per day. > > (2) Second question. A friend who works in health sciences suggests > a student T test is not a good choice and I should use an ARIMA > model to "correct for autocorellation on the regressions". > Myself, for this simple hypothesis, I'm not seeing this, but is my > friend's observation valid? > > (Meta-question: is it common to whiten the data before performing > a statistical test, and if so, why?) > > (Other meta-question, does not an ARIMA tool, after it's done munging > and massaging the data series, perform a statistical test such as Student t, > or a more generalized statistical test, anyway?) > > On to the next question: I do want to improve upon my result. One > problem I have is that the early part of the series is noisier than > the later part of the series since there is more noise averaging > over 6000 positive tests (reported in California on March 29) than on 70 > positives (reported on March 7). (Under the assumption there is > fixed noise component on each individual test result.) > > What I'd like to do is some form of Maximal Ratio Combining to fix this. > Which I have not yet done, but I can figure it out. > > (3) Third question: is there a standard way of doing a statistical > test when the SNR in the data series is evolving over time? > > Thanks much if you have gotten this far. > > SteveI think a Generalized Additive Model on the log of the data could help to determine whether the traces of two different states are different or not. You could incorporate the autocorrelation in the model and also compute the derivatives to see if it's getting slower or not. Check this out: http://jacolienvanrij.com/Tutorials/GAMM.html#summed-effects-with-without-random-effects