[Here is the source of this document. Click here to show / hide the R code used. ]
To estimate the prevalence of the COVID-19 virus in a given population, a simple strategy would be to take a random sample and test everyone. Usually the cost of sampling is proportional to the number of individuals in the sample. However, in the current scenario, there are two different cost aspects: sample collection and the actual testing. Given the limited availability of test kits / reagents, a possible strategy is to combine several individual samples (swabs) and test them together.
The following analysis assumes a very simple setup: that the cost of collecting samples (swabs) is negligible, but the cost of testing is not. In practice, the considerations will be more complex, but the same ideas should go through.
Assuming that samples (swabs) are randomly mixed before being tested, the estimation calculations are fairly simple. Suppose that
Here the idea is that
Suppose the true proportion of positive cases in the population is
The obvious point estimates then are
Ideally, we want to choose
The following plot shows simulated
M <- 1000
P <- 0.01
g <- expand.grid(m = M, k = 1:70, p = P, rep = 1:500)
g <- within(g,
{
t <- rbinom(nrow(g), size = m, prob = 1 - (1-p)^k)
qhat <- t / m
phat <- 1 - (1-qhat)^(1/k)
})
xyplot(phat ~ k, data = g, jitter.x = TRUE, grid = TRUE, alpha = 0.2,
abline = list(h = P), ylab = expression(hat(p)))
Clearly, increasing
The following plot shows the (simulation) mean squared error (in
estimating
xyplot(tapply((phat - p)^2, k, mean) ~ tapply(k, k, unique),
data = g, grid = TRUE, type = "o", pch = 16,
xlab = "k",
ylab = expression("Mean Square Error of " * hat(p))) +
layer_(panel.curve((1 - (1-P)^x) * (1-P)^(2-x) / (M*x^2), from = 1))
This confirms our visual impression that increasing
The following plot again shows simulated
M <- 1000
P <- 0.1
g <- expand.grid(m = M, k = 1:70, p = P, rep = 1:500)
g <- within(g,
{
t <- rbinom(nrow(g), size = m, prob = 1 - (1-p)^k)
qhat <- t / m
phat <- 1 - (1-qhat)^(1/k)
})
xyplot(phat ~ k, data = g, jitter.x = TRUE, grid = TRUE, alpha = 0.2,
abline = list(h = P), ylab = expression(hat(p)))
The problem here is that when combining too many samples, the per-test
positive probability becomes too high, and often all the
Restricting to
xyplot(phat ~ k, data = g, jitter.x = TRUE, grid = TRUE, alpha = 0.2,
subset = (k <= 40) & (phat < 1),
abline = list(h = P), ylab = expression(hat(p)))
Here, increasing
The plot of the corresponding (empirical and theoretical) mean squared
error suggests an optimal range of
xyplot(tapply((phat - p)^2, k, mean) ~ tapply(k, k, unique),
data = g, grid = TRUE, type = "o", pch = 16,
subset = (k <= 40) & (phat < 1),
xlab = "k",
ylab = expression("Mean Square Error of " * hat(p))) +
layer(panel.curve((1 - (1-P)^x) * (1-P)^(2-x) / (M*x^2), from = 1))
The following plot shows how the theoretical precision (mean square
error) changes with
mse.p <- expand.grid(m = 1000, k = 1:30, p = c(0.01, 0.02, 0.05, 0.1, 0.2, 0.3))
mse.p <- transform(mse.p,
MSE = ((1 - (1-p)^k) * (1-p)^(2-k) / (m*k^2)),
P = factor(p, labels = sprintf("p = %g", sort(unique(p)))))
xyplot(MSE ~ k | P, mse.p, type = "l", as.table = TRUE, grid = TRUE,
scales = list(y = "free", rot = 0, x = list(alternating = 3)),
between = list(y = 1))
Given the risk of making
The actual optimization problem will have to be formulated depending
on the practical constraints. But ultimately it boils down to choosing
an optimal value of
The optimal choices of
For a given
The explicit expression for the mean squared error, although valid only asymptotically, seems to be in good agreement with simulated results. This makes the optimization exercise fairly simple.
Simulation is still a useful tool to anticipate extreme behaviour
not reflected by the mean squared error alone (such as
Of course, we will not know
Once a value of
Although not directly relevant for choosing
An R implementation of both are given for reference (Click here to show / hide code.)
normal.lcl <- function(t, k = 1, m = 100, level = 0.95)
{
## Approximate (Gaussian) confidence interval for p when
## - k swabs tested in each group
## - m such groups (for a total of n=k*m individuals sampled)
qhat <- t / m
qlo <- t / m + qnorm(1-level) * sqrt(qhat * (1-qhat) / m)
1 - (1-qlo)^(1/k)
}
exact.lcl <- function(t, k = 1, m = 100, level = 0.95)
{
## Exact (binomial) lower confidence limit for p when
## - k swabs tested in each group
## - m such groups (for a total of n=k*m individuals sampled)
pval.shifted <- function(p0)
{
pbinom(t, m, 1 - (1 - p0)^k, lower.tail = FALSE) - (1-level)
}
lcl <- try(uniroot(pval.shifted, c(0, t/m))$root, silent = TRUE)
if (inherits(lcl, "try-error")) NA else lcl
}
The following plot shows the approximate lower confidence limits in
the same setup as above (but with
M <- 1000
P <- 0.01
g <- expand.grid(m = M, k = 1:40, p = P, rep = 1:500)
g <- within(g,
{
t <- rbinom(nrow(g), size = m, prob = 1 - (1-p)^k)
lcl.approx <- normal.lcl(t, k = k, m = m, level = 0.95)
})
xyplot(lcl.approx ~ k, data = g, jitter.x = TRUE, grid = TRUE, alpha = 0.2,
abline = list(h = P), ylab = expression("LCL for p")) +
layer(panel.average(x, y, fun = function(x) quantile(x, 0.1),
horizontal = FALSE, col = "red"))
Similar plot for
M <- 1000
P <- 0.1
g <- expand.grid(m = M, k = 1:40, p = P, rep = 1:500)
g <- within(g,
{
t <- rbinom(nrow(g), size = m, prob = 1 - (1-p)^k)
lcl.approx <- normal.lcl(t, k = k, m = m, level = 0.95)
})
xyplot(lcl.approx ~ k, data = g, jitter.x = TRUE, grid = TRUE, alpha = 0.2,
abline = list(h = P), ylab = expression("LCL for p")) +
layer(panel.average(x, y, fun = function(x) quantile(x, 0.1),
horizontal = FALSE, col = "red"))