With Ken Cor.

What’s the least amount of data you need to collect to estimate the population mean with a particular standard error? For the simplest case—estimating the mean of a binomial variable using simple random sampling, a conservative estimate of the variance (p=.5), and a ±3 confidence interval—the answer (n∼1,000) is well known. The simplest case, however, assumes little to no information. Often, we know more. In opinion polling, we generally know sociodemographic strata in the population. And we have historical data on the variability in strata. Take, for instance, measuring support for Mr. Obama. A polling company like YouGov will usually have a long time series, including information about respondent characteristics. Using this data, the company could derive how variable the support for Mr. Obama is among different sociodemographic groups. With information about strata and strata variances, we can often poll fewer people (vis-a-vis random sampling) to estimate the population mean with a particular s.e. In a note (pdf), we show how.

### Why bother?

In a realistic example, we find the benefit of using optimal allocation over simple random sampling is 6.5% (see the code block below).

Assuming two groups a and b, and using the notation in the note (see the pdf)—wa denotes the proportion of group a in the population, vara and varb denote the variances of group a and b respectively, and letting p denote sample mean, we find that if you use the simple random sampling formula, you will estimate that you need to sample 1095 people. If you optimally exploit the information about strata and strata variances, you will need to just sample 1024 people.

```
## The Benefit of Using Optimal Allocation Rules
## wa = .8
## vara = .25; pa = .5
## varb = .16; pb = .8
## SRS: pop_mean of .8*.5 + .2*.8 = .56
# sqrt(p(1 -p)/n) = .015
# n = p*(1- p)/.015^2 = 1095
# optimal_n_plus_allocation(.8, .25, .16, .015)
# n na nb
#1024 853 171
```

**Github Repo.**: https://github.com/soodoku/optimal_data_collection/