Variância da estatística

O de Cohen $d$ é uma das maneiras mais comuns de medir o tamanho de um efeito ( consulte a Wikipedia ). Simplesmente mede a distância entre duas médias em termos do desvio padrão combinado. Como podemos derivar a fórmula matemática da estimativa de variância do de Cohen $d$ ?

Dezembro de 2015 editar: relacionada a esta questão está a ideia de calcular intervalos de confiança em torno de $d$ . Este artigo afirma que

σ_{d}^{2} = \frac{n_{+}}{n_{\times}} + \frac{d^{2}}{2 n_{+}}

$\sigma_{d}^2 = \dfrac{n_{+}}{n_{\times}} + \dfrac{d^2}{2n_{+}}$

onde $n_{+}$ representa a soma dos dois tamanhos de amostra e $n_{\times}$ é o produto dos dois tamanhos de amostra.

Como é derivada essa fórmula?

variance effect-size cohens-d

— JRK
fonte

@ Clarinetist: É um pouco controverso editar a pergunta de outra pessoa para adicionar mais substância e mais perguntas a ela (em vez de melhorar a redação). Tomei a liberdade de aprovar sua edição (já que você ofereceu uma recompensa generosa e acho que sua edição melhora a pergunta), mas outros podem decidir reverter.

— Ameba diz Reinstate Monica

@amoeba Sem problemas. Desde que a fórmula esteja lá para

(que não existia antes) e fique claro que estamos procurando uma derivação matemática da fórmula, tudo bem.

σ_{d}^{2}

$\sigma^2_d$

— Clarinetist

Eu acho que o denominador da segunda fração deve ser

. Veja minha resposta abaixo.

2 (n_{+} - 2)

$2(n_{+}-2)$

Observe que a expressão de variação na pergunta é uma aproximação. Hedges (1981) derivada da grande variação da amostra de e aproximação de uma configuração geral (isto é, múltiplas experiências / estudos), e minha resposta praticamente caminha através das derivações do papel. $d$

Primeiro, as suposições que utilizaremos são as seguintes:

Vamos supor que temos dois grupos de tratamento independentes, (tratamento) e (controle). Seja e as pontuações / respostas / o que quer que seja do sujeito no grupo e do sujeito no grupo $T$ $C$ $Y_{Ti}$ $Y_{Cj}$ $i$ $T$ $j$ , respectivamente. $C$

Assumimos que as respostas são normalmente distribuídas e os grupos de tratamento e controle compartilham uma variação comum, ou seja,

\begin{aligned} Y_{T i} & \sim N (μ_{T}, σ^{2}), i = 1, \dots n_{T} \\ Y_{C j} & \sim N (μ_{C}, σ^{2}), j = 1, \dots n_{C} \end{aligned}

$\begin{align*} Y_{Ti} &\sim N(\mu_T, \sigma^2), \quad i = 1, \dots n_T \\ Y_{Cj} &\sim N(\mu_C, \sigma^2), \quad j = 1, \dots n_C \end{align*}$

O tamanho do efeito que estamos interessados em estimar em cada estudo é . O estimador do tamanho do efeito que usaremos é $\delta = \frac{\mu_T - \mu_C}{\sigma}$ ondeé a variância da amostra imparcial para o grupo

d = \frac{{\bar{Y}}_{T} - {\bar{Y}}_{C}}{\sqrt{\frac{(n_{T} - 1) S_{T}^{2} + (n_{C} - 1) S_{C}^{2}}{n_{T} + n_{C} - 2}}}

$\begin{equation*} d = \frac{\bar{Y}_T - \bar{Y}_C}{\sqrt{\frac{(n_T - 1)S_T^2 + (n_C - 1)S_C^2}{n_T + n_C - 2}}} \end{equation*}$

S_{k}^{2}

$S_k^2$

k

$k$

Vamos considerar as propriedades de amostra grande de $d$ .

Em primeiro lugar, nota que: e (solta com a minha notação):

{\bar{Y}}_{T} - {\bar{Y}}_{C} \sim N (μ_{T} - μ_{C}, σ^{2} \frac{n_{T} + n_{C}}{n_{T} n_{C}})

$\begin{equation*} \bar{Y}_T - \bar{Y}_C \sim N \Bigg( \mu_T - \mu_C, \,\sigma^2\frac{n_T + n_C}{n_T n_C} \Bigg) \end{equation*}$

\begin{matrix} (1) & \frac{(n_{T} - 1) S_{T}^{2}}{σ^{2} (n_{T} + n_{C} - 2)} = \frac{1}{n_{T} + n_{C} - 2} \frac{(n_{T} - 1) S_{T}^{2}}{σ^{2}} \sim \frac{1}{n_{T} + n_{C} - 2} χ_{n_{T} - 1}^{2} \end{matrix}

$\begin{equation} \frac{(n_T - 1)S_T^{2}}{\sigma^2(n_T + n_C - 2)} = \frac{1}{n_T + n_C - 2}\frac{(n_T - 1)S_T^{2}}{\sigma^2} \sim \frac{1}{n_T + n_C- 2}\chi_{n_T - 1}^2 \tag{1} \end{equation}$

\begin{matrix} (2) & \frac{(n_{C} - 1) S_{C}^{2}}{σ^{2} (n_{T} + n_{C} - 2)} = \frac{1}{n_{T} + n_{C} - 2} \frac{(n_{C} - 1) S_{C}^{2}}{σ^{2}} \sim \frac{1}{n_{T} + n_{C} - 2} χ_{n_{C} - 1}^{2} \end{matrix}

$\begin{equation} \frac{(n_C - 1)S_C^{2}}{\sigma^2(n_T + n_C - 2)} = \frac{1}{n_T + n_C - 2}\frac{(n_C - 1)S_C^{2}}{\sigma^2} \sim \frac{1}{n_T + n_C- 2}\chi_{n_C - 1}^2 \tag{2} \end{equation}$

Equations (1) and (2) lead to the fact that (again, being loose with my notation):

\frac{1}{σ^{2}} \frac{(n_{T} - 1) S_{T}^{2} + (n_{C} - 1) S_{C}^{2}}{n_{T} + n_{C} - 2} \sim \frac{1}{n_{T} + n_{C} - 2} χ_{n_{T} + n_{C} - 2}^{2}

$\begin{equation*} \frac{1}{\sigma^2}\frac{(n_T - 1)S_T^{2} + (n_C - 1)S_C^{2}}{n_T + n_C - 2} \sim \frac{1}{n_T + n_C - 2}\chi_{n_T + n_C - 2}^2 \end{equation*}$

Now, some clever algebra:

\begin{aligned} d & = \frac{{\bar{Y}}_{T} - {\bar{Y}}_{C}}{\sqrt{\frac{(n_{T} - 1) S_{T}^{2} + (n_{C} - 1) S_{C}^{2}}{n_{T} + n_{C} - 2}}} \\ = \frac{{(σ \sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}})}^{- 1} ({\bar{Y}}_{T} - {\bar{Y}}_{C})}{{(σ \sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}})}^{- 1} \sqrt{\frac{(n_{T} - 1) S_{T}^{2} + (n_{C} - 1) S_{C}^{2}}{n_{T} + n_{C} - 2}}} \\ = \frac{\frac{({\bar{Y}}_{T} - {\bar{Y}}_{C}) - (μ_{T} - μ_{C})}{σ \sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}}} + \frac{μ_{T} - μ_{C}}{σ \sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}}}}{{(\sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}})}^{- 1} \sqrt{\frac{(n_{T} - 1) S_{T}^{2} + (n_{C} - 1) S_{C}^{2}}{σ^{2} (n_{T} + n_{C} - 2)}}} \\ = \sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}} (\frac{θ + δ \sqrt{\frac{n_{T} n_{C}}{n_{T} + n_{C}}}}{\sqrt{\frac{V}{ν}}}) \end{aligned}

$\begin{align*} d &= \frac{\bar{Y}_T - \bar{Y}_C}{\sqrt{\frac{(n_T - 1)S_T^2 + (n_C - 1)S_C^2}{n_T + n_C - 2}}} \\\\ &= \frac{\left(\sigma\sqrt{\frac{n_T + n_C}{n_T n_C}}\right)^{-1}(\bar{Y}_T - \bar{Y}_C)}{\left(\sigma\sqrt{\frac{n_T + n_C}{n_T n_C}}\right)^{-1}\sqrt{\frac{(n_T - 1)S_T^2 + (n_C - 1)S_C^2}{n_T + n_C - 2}}} \\\\ &= \frac{\frac{(\bar{Y}_T - \bar{Y}_C) - (\mu_T - \mu_C)}{\sigma\sqrt{\frac{n_T + n_C}{n_T n_C}}} + \frac{\mu_T - \mu_C}{\sigma\sqrt{\frac{n_T + n_C}{n_T n_C}}}}{\left(\sqrt{\frac{n_T + n_C}{n_T n_C}}\right)^{-1}\sqrt{\frac{(n_T - 1)S_T^2 + (n_C - 1)S_C^2}{\sigma^2(n_T + n_C - 2)}}} \\\\ &= \sqrt{\frac{n_T + n_C}{n_T n_C}}\left(\frac{\theta + \delta\sqrt{\frac{n_T n_C}{n_T + n_C}}}{\sqrt{\frac{V}{\nu}}}\right) \end{align*}$ where

θ \sim N (0, 1)

$\theta \sim N(0,1)$ ,

V \sim χ_{ν}^{2}

$V \sim \chi^2_{\nu}$ , and

ν = n_{T} + n_{C} - 2

$\nu = n_T+n_C-2$ . Thus,

d

$d$ is

\sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}}

$\sqrt{\frac{n_T + n_C}{n_T n_C}}$ times a variable which follows a non-central t-distribution with

n_{T} + n_{C} - 2

$n_T + n_C - 2$ degrees of freedom and non-centrality parameter of

δ \sqrt{\frac{n_{T} n_{C}}{n_{T} + n_{C}}}

$\delta\sqrt{\frac{n_T n_C}{n_T + n_C}}$ .

Using the moment properties of the non-central $t$ distribution, it follows that:

\begin{matrix} (3) & V a r (d) = \frac{(n_{T} + n_{C} - 2)}{(n_{T} + n_{C} - 4)} \frac{(n_{T} + n_{C})}{n_{T} n_{C}} (1 + δ^{2} \frac{n_{T} n_{C}}{n_{T} + n_{C}}) - \frac{δ^{2}}{b^{2}} \end{matrix}

$\begin{equation*} \mathrm{Var}(d) = \frac{(n_T + n_C - 2)}{(n_T + n_C - 4)}\frac{(n_T + n_C)}{n_T n_C}(1+ \delta^2\frac{n_T n_C}{n_T + n_C}) - \frac{\delta^2}{b^2} \tag{3} \end{equation*}$ where

b = \frac{Γ (\frac{n_{T} + n_{C} - 2}{2})}{\sqrt{\frac{n_{T} + n_{C} - 2}{2}} Γ (\frac{n_{T} + n_{C} - 3}{2})} \approx 1 - \frac{3}{4 (n_{T} + n_{C} - 2) - 1}

$\begin{equation*} b = \frac{\Gamma\left(\frac{n_T + n_C - 2}{2}\right)}{\sqrt{\frac{n_T+n_C-2}{2}}\Gamma\left(\frac{n_T+n_C-3}{2}\right)} \approx 1 - \frac{3}{4(n_T+n_C-2)-1} \end{equation*}$

So Equation (3) provides the exact large sample variance. Note that an unbiased estimator for $\delta$ is $b d$ , with variance:

V a r (b d) = b^{2} \frac{(n_{T} + n_{C} - 2)}{(n_{T} + n_{C} - 4)} \frac{(n_{T} + n_{C})}{n_{T} n_{C}} (1 + δ^{2} \frac{n_{T} n_{C}}{n_{T} + n_{C}}) - δ^{2}

$\begin{equation*} \mathrm{Var}(bd) = b^2\frac{(n_T + n_C - 2)}{(n_T + n_C - 4)}\frac{(n_T + n_C)}{n_T n_C}(1+ \delta^2\frac{n_T n_C}{n_T + n_C}) - \delta^2 \end{equation*}$

For large degrees of freedom (i.e. large $n_T+n_C-2$ ), the variance of a non-central $t$ variate with $\nu$ degrees of freedom and non-centrality parameter $p$ can be approximated by $1 + \frac{p^2}{2\nu}$ (Johnson, Kotz, Balakrishnan, 1995). Thus, we have:

\begin{aligned} V a r (d) & \approx \frac{n_{T} + n_{C}}{n_{T} n_{C}} (1 + \frac{δ^{2} (\frac{n_{T} n_{C}}{n_{T} + n_{C}})}{2 (n_{T} + n_{C} - 2)}) \\ = \frac{n_{T} + n_{C}}{n_{T} n_{C}} + \frac{δ^{2}}{2 (n_{T} + n_{C} - 2)} \end{aligned}

$\begin{align*} \mathrm{Var}(d) &\approx \frac{n_T + n_C}{n_T n_C}\left(1 + \frac{\delta^2\left(\frac{n_T n_C}{n_T + n_C}\right)}{2(n_T+n_C-2)}\right) \\\\ &= \frac{n_T + n_C}{n_T n_C} + \frac{\delta^2}{2(n_T+n_C-2)} \end{align*}$

Plug in our estimator for $\delta$ and we're done.

Very, very nice derivation. Just a few questions: 1) could you clarify what the notation

{\bar{Y}}_{i}^{T} - {\bar{Y}}_{i}^{C}

$\bar{Y}^{T}_{i} - \bar{Y}^{C}_{i}$ means (I know it's something to do with difference of sample means, but how can they both have the same index?)? 2) could you clarify how the approximation for

b

$b$ is done (I don't need all of the details, a source is fine and maybe a brief explanation)? Otherwise, I'm quite pleased with this. (+1) This also agrees with the observation that I've made that

d

$d$ doesn't follow a normal distribution, contrary to the explanation in the linked article in the OP.

— Clarinetist

@Clarinetist Thanks! 1) How can they have the same index? Typo, that's how! :P They're an artifact of my first draft of the answer. I'll fix that. 2) I pulled it from the Hedges paper -- don't know its derivation at the moment but will think about it some more.

I'm looking into the derivation now, but FYI, the numerator of

b

$b$ should be

Γ (\frac{n_{T} + n_{C} - 2}{2})

$\Gamma\left(\dfrac{n_T+n_C-2}{2}\right)$ .

— Clarinetist

Derivation provided for reference: math.stackexchange.com/questions/1564587/… . Turns out there's likely a sign error.

— Clarinetist

@mike : very impressing answer. Thanks for taking the time to share it with us.

— Denis Cousineau