Já existem excelentes respostas para essa pergunta, mas quero responder por que o erro padrão é o que é, por que usamos como o pior caso e como o erro padrão varia com n .p=0.5n
Suponha que façamos uma pesquisa com apenas um eleitor, vamos chamá-lo de eleitor 1 e perguntar "você votará no Partido Roxo?" Podemos codificar a resposta como 1 para "sim" e 0 para "não". Digamos que a probabilidade de um "sim" seja . Agora temos uma variável aleatória binária X 1 que é 1 com probabilidade p e 0 com probabilidade 1 - p . Dizemos que X 1 é uma variável de Bernouilli com probabilidade de sucesso p , que podemos escrever X 1 ∼ B e r n o u i l l i ( p )pX1p1−pX1pX1∼Bernouilli(p). O esperado, ou média, o valor de é dada por E ( X 1 ) = Σ x P ( X 1 = X ) em que soma sobre todos os possíveis resultados x de X 1 . Mas existem apenas dois resultados, 0 com probabilidade 1 - p e 1 com probabilidade p , então a soma é apenas E ( X 1 ) = 0 ( 1 - p ) + 1 ( p )X1E(X1)=∑xP(X1=x)xX11−pp . Pare e pense. Na verdade, isso parece completamente razoável - se houver uma chance de 30% do eleitor 1 apoiar o Partido Roxo e codificarmos a variável como 1 se eles disserem "sim" e 0 se eles disserem "não", então espera que X 1 seja 0,3 em média.E(X1)=0(1−p)+1(p)=pX1
Vamos pensar no que acontece, quadrado de . Se X 1 = 0, então X 2 1 = 0 e se X 1 = 1, então X 2 1 = 1 . Então, de fato, X 2 1 = X 1 em ambos os casos. Como eles são iguais, eles devem ter o mesmo valor esperado, então E ( X 2 1 ) = p . Isso me fornece uma maneira fácil de calcular a variação de uma variável de Bernouilli: eu uso V aX1X1=0X21=0X1=1X21=1X21=X1E(X21)=p e, portanto, o desvio padrão é σ X 1 = √Var(X1)=E(X21)−E(X1)2=p−p2=p(1−p) .σX1=p(1−p)−−−−−−−√
Obviamente, quero falar com outros eleitores - vamos chamá-los de eleitor 2, eleitor 3, até o eleitor . Vamos supor que todos eles têm a mesma probabilidade p de apoiar o Partido roxo. Agora temos n variáveis de Bernouilli, X 1 , X 2 a X n , com cada X i ∼ B e r n o u l l i ( p ) para i de 1 a n . Todos eles têm a mesma média, p , e variância, p (npnX1X2XnXi∼Bernoulli(p)inp .p(1−p)
Gostaria de descobrir quantas pessoas na minha amostra disseram "sim" e, para fazer isso, posso somar todo o . Vou escrever X = Σ n i = 1 X i . Posso calcular o valor médio ou esperado de X usando a regra de que E ( X + Y ) = E ( X ) + E ( Y ) se essas expectativas existirem, e estendendo esse valor para E ( X 1 + X 2 + … + XXiX=∑ni=1XiXE(X+Y)=E(X)+E(Y) . Mas estou somando n dessas expectativas, e cada uma é p , então chego ao total que E ( X ) = n p . Pare e pense. Se eu entrevistar 200 pessoas e cada uma tiver 30% de chance de dizer que apóiam o Partido Roxo, é claro que eu esperaria que 0,3 x 200 = 60 pessoas dissessem "sim". Assim, o n p fórmula parece certo. Menos "óbvio" é como lidar com a variação.E(X1+X2+…+Xn)=E(X1)+E(X2)+…+E(Xn)npE(X)=npnp
There is a rule that says
Var(X1+X2+…+Xn)=Var(X1)+Var(X2)+…+Var(Xn)
but I can only use it
if my random variables are independent of each other. So fine, let's make that assumption, and by a similar logic to before I can see that
Var(X)=np(1−p). If a variable
X is the sum of
n independent Bernoulli trials, with identical probability of success
p, then we say that
X has a binomial distribution,
X∼Binomial(n,p). We have just shown that the mean of such a binomial distribution is
np and the variance is
np(1−p)
pp^=X/n. For instance of 64 out of our sample of 200 people said "yes", we'd estimate that 64/200 = 0.32 = 32% of people say they support the Purple Party. You can see that p^ is a "scaled-down" version of our total number of yes-voters, X. That means it is still a random variable, but no longer follows the binomial distribution. We can find its mean and variance, because when we scale a random variable by a constant factor k then it obeys the following rules: E(kX)=kE(X) (so the mean scales by the same factor k) and Var(kX)=k2Var(X). Note how variance scales by k2. That makes sense when you know that in general, the variance is measured in the square of whatever units the variable is measured in: not so applicable here, but if our random variable had been a height in cm then the variance would be in cm2 which scale differently - if you double lengths, you quadruple area.
Here our scale factor is 1n. This gives us E(p^)=1nE(X)=npn=p. This is great! On average, our estimator p^ is exactly what it "should" be, the true (or population) probability that a random voter says that they will vote for the Purple Party. We say that our estimator is unbiased. But while it is correct on average, sometimes it will be too small, and sometimes too high. We can see just how wrong it is likely to be by looking at its variance. Var(p^)=1n2Var(X)=np(1−p)n2=p(1−p)n. The standard deviation is the square root, p(1−p)n−−−−−√, and because it gives us a grasp of how badly our estimator will be off (it is effectively a root mean square error, a way of calculating the average error that treats positive and negative errors as equally bad, by squaring them before averaging out), it is usually called the standard error. A good rule of thumb, which works well for large samples and which can be dealt with more rigorously using the famous Central Limit Theorem, is that most of the time (about 95%) the estimate will be wrong by less than two standard errors.
Since it appears in the denominator of the fraction, higher values of n - bigger samples - make the standard error smaller. That is great news, as if I want a small standard error I just make the sample size big enough. The bad news is that n is inside a square root, so if I quadruple the sample size, I will only halve the standard error. Very small standard errors are going to involve very very large, hence expensive, samples. There's another problem: if I want to target a particular standard error, say 1%, then I need to know what value of p to use in my calculation. I might use historic values if I have past polling data, but I would like to prepare for the worst possible case. Which value of p is most problematic? A graph is instructive.
The worst-case (highest) standard error will occur when p=0.5. To prove that I could use calculus, but some high school algebra will do the trick, so long as I know how to "complete the square".
p(1−p)−−−−−−−√=p−p2−−−−−√=14−(p2−p+14)−−−−−−−−−−−−−−√=14−(p−12)2−−−−−−−−−−−√
The expression is the brackets is squared, so will always return a zero or positive answer, which then gets taken away from a quarter. In the worst case (large standard error) as little as possible gets taken away. I know the least that can be subtracted is zero, and that will occur when p−12=0, so when p=12. The upshot of this is that I get bigger standard errors when trying to estimate support for e.g. political parties near 50% of the vote, and lower standard errors for estimating support for propositions which are substantially more or substantially less popular than that. In fact the symmetry of my graph and equation show me that I would get the same standard error for my estimates of support of the Purple Party, whether they had 30% popular support or 70%.
So how many people do I need to poll to keep the standard error below 1%? This would mean that, the vast majority of the time, my estimate will be within 2% of the correct proportion. I now know that the worst case standard error is 0.25n−−−√=0.5n√<0.01 which gives me n−−√>50 and so n>2500. That would explain why you see polling figures in the thousands.
In reality low standard error is not a guarantee of a good estimate. Many problems in polling are of a practical rather than theoretical nature. For instance, I assumed that the sample was of random voters each with same probability p, but taking a "random" sample in real life is fraught with difficulty. You might try telephone or online polling - but not only has not everybody got a phone or internet access, but those who don't may have very different demographics (and voting intentions) to those who do. To avoid introducing bias to their results, polling firms actually do all kinds of complicated weighting of their samples, not the simple average ∑Xinthat I took. Also, people lie to pollsters! The different ways that pollsters have compensated for this possibility is, obviously, controversial. You can see a variety of approaches in how polling firms have dealt with the so-called Shy Tory Factor in the UK. One method of correction involved looking at how people voted in the past to judge how plausible their claimed voting intention is, but it turns out that even when they're not lying, many voters simply fail to remember their electoral history. When you've got this stuff going on, there's frankly very little point getting the "standard error" down to 0.00001%.
To finish, here are some graphs showing how the required sample size - according to my simplistic analysis - is influenced by the desired standard error, and how bad the "worst case" value of p=0.5 is compared to the more amenable proportions. Remember that the curve for p=0.7 would be identical to the one for p=0.3 due to the symmetry of the earlier graph of p(1−p)−−−−−−−√