Entendendo o teste t para regressão linear

17

Estou tentando descobrir como executar alguns testes de hipótese em uma regressão linear (hipótese nula sem correlação). Todo guia e página sobre o assunto em que encontro parece estar usando um teste t. Mas não entendo o que significa o teste t para regressão linear. Um teste t, a menos que eu tenha um entendimento ou modelo mental completamente errado, é usado para comparar duas populações. Mas o regressor e o regressando não são amostras de populações semelhantes e podem até não pertencer à mesma unidade, portanto, não faz sentido compará-las.

Então, ao usar um teste t em uma regressão linear, o que estamos realmente fazendo?

regression t-test

— jaymmer - Restabelecer Monica
fonte

37

Você provavelmente está pensando no teste duas amostras $t$ , porque esse é geralmente o primeiro local em que a distribuição $t$ aparece. Mas, na verdade, tudo que um teste $t$ significa é que a distribuição de referência para a estatística de teste é uma distribuição $t$ . Se $Z \sim \mathcal N(0,1)$ e $S^2 \sim \chi^2_d$ com $Z$ e $S^2$ independentes, então

\frac{Z}{\sqrt{S^{2} / d}} \sim t_{d}

$\frac{Z}{\sqrt{S^2 / d}} \sim t_d$ por definição. Estou escrevendo isso para enfatizar que adistribuição

t

$t$ é apenas um nome que foi dado à distribuição dessa proporção porque ela aparece muito, e qualquer coisa dessa forma terá umadistribuição

t

$t$ . Para o teste t de duas amostras, essa razão aparece porque, sob o nulo, a diferença de médias é um Gaussiano com média zero e a estimativa de variância para Gaussianos independentes é um

χ^{2}

$\chi^2$ independente (a independência pode ser mostrada peloteorema de Basu que usa o fato de que a estimativa de variância padrão em uma amostra gaussiana é acessória à média da população, enquanto a média da amostra é completa e suficiente para a mesma quantidade).

Com a regressão linear, basicamente obtemos a mesma coisa. Na forma de . Seja $\hat \beta \sim \mathcal N(\beta, \sigma^2 (X^T X)^{-1})$ e assuma que os preditoressão não aleatórios. Se soubéssemosteríamos $S^2_j = (X^T X)^{-1}_{jj}$ $X$ $\sigma^2$ sob o valor nulo de

\frac{{\hat{β}}_{j} - 0 0}{σ S_{j}} \sim N (0 0, 1 1)

$\frac{\hat \beta_j - 0}{\sigma S_j} \sim \mathcal N(0, 1)$

então teríamos um teste Z. Mas uma vez que estimar

acabamos com um

variável aleatória que, sob as nossas suposições de normalidade, acaba por ser independente da nossa estatística

e depois temos um

de distribuição.

H_{0} : β_{j} = 0

$H_0 : \beta_j = 0$

σ^{2}

$\sigma^2$

χ^{2}

$\chi^2$

{\hat{β}}_{j}

$\hat \beta_j$

t

$t$

Aqui estão os detalhes disso: assuma . Deixando ser a matriz do chapéu, temos é idempotente, por isso temos o resultado muito bom que $y \sim \mathcal N(X\beta, \sigma^2 I)$ $H = X(X^TX)^{-1}X^T$

__e {__}^{2} =__(Eu - H) y {__}^{2} = y^{T} (Eu - H) y .

$\|e\|^2 = \|(I-H)y\|^2 = y^T(I-H)y.$

H

$H$

com parâmetro de não centralidade

, então, na verdade, esse é um

centralcom

y^{T} (Eu - H) y / σ^{2} \sim χ_{n - p}^{2} (δ)

$y^T(I-H)y / \sigma^2 \sim \mathcal \chi_{n-p}^2(\delta)$

δ = β^{T} X^{T} (I - H) X β = β^{T} (X^{T} X - X^{T} X) β = 0

$\delta = \beta^TX^T(I-H)X\beta = \beta^T(X^TX - X^T X)\beta = 0$

χ^{2}

$\chi^2$

n - p

$n-p$ graus de liberdade (este é um caso especial do teorema de Cochran ). Estou usando

para denotar o número de colunas de

, portanto, se uma coluna de

fornecer a interceptação, teríamos

preditores de não interceptação. Alguns autores usam

para ser o número de preditores de não interceptação; portanto, às vezes você pode ver algo como

nos graus de liberdade, mas é tudo a mesma coisa.

p

$p$

X

$X$

X

$X$

p - 1

$p-1$

p

$p$

n - p - 1

$n-p-1$

$E(e^Te / \sigma^2) = n-p$ $\hat \sigma^2 := \frac{1}{n-p} e^T e$ $\sigma^2$

\frac{{\hat{β}}_{j}}{\hat{σ} S_{j}} = \frac{{\hat{β}}_{j}}{S_{j} \sqrt{e^{T} e / (n - p)}} = \frac{{\hat{β}}_{j}}{σ S_{j} \sqrt{\frac{e^{T} e}{σ^{2} (n - p)}}}

$\frac{\hat \beta_j}{\hat \sigma S_j}= \frac{\hat \beta_j}{S_j\sqrt{e^Te / (n-p)}} = \frac{\hat \beta_j}{\sigma S_j\sqrt{\frac{e^Te}{\sigma^2(n-p)}}}$

$Z \sim \mathcal N_k(\mu, \Sigma)$ $A$ $B$ $\mathbb R^{l\times k}$ $\mathbb R^{m\times k}$ $AZ$ $BZ$ $A\Sigma B^T = 0$

$\hat \beta = (X^TX)^{-1}X^T y$ $e = (I-H)y$ $y \sim \mathcal N(X\beta, \sigma^2 I)$

(X^{T} X)^{- 1} X^{T} \cdot σ^{2} I \cdot (I - H)^{T} = σ^{2} ((X^{T} X)^{- 1} X^{T} - (X^{T} X)^{- 1} X^{T} X (X^{T} X)^{- 1} X^{T}) = 0

$(X^TX)^{-1}X^T \cdot \sigma^2 I \cdot (I-H)^T = \sigma^2 \left((X^TX)^{-1}X^T - (X^TX)^{-1}X^TX(X^TX)^{-1}X^T\right) = 0$

\hat{β} ⊥ e

$\hat \beta \perp e$

\hat{β} ⊥ e^{T} e

$\hat \beta \perp e^T e$

\frac{{\hat{β}}_{j}}{\hat{σ} S_{j}} \sim t_{n - p}

$\frac{\hat \beta_j}{\hat \sigma S_j} \sim t_{n-p}$

$C = {A \choose B}$ $(l+m)\times k$ $A$ $B$

C Z = (\binom{A Z}{B Z}) \sim N ((\binom{A μ}{B μ}), C Σ C^{T})

$CZ = {AZ \choose BZ} \sim \mathcal N \left({A\mu \choose B\mu}, C\Sigma C^T \right)$

C Σ C^{T} = (\binom{A}{B}) Σ (\begin{array}{cc} A^{T} & B^{T} \end{array}) = (\begin{array}{cc} A Σ A^{T} & A Σ B^{T} \\ B Σ A^{T} & B Σ B^{T} \end{array}) .

$C\Sigma C^T = {A \choose B} \Sigma \left(\begin{array}{cc} A^T & B^T \end{array}\right) = \left(\begin{array}{cc}A\Sigma A^T & A\Sigma B^T \\ B\Sigma A^T & B\Sigma B^T\end{array}\right).$

C Z

$CZ$ is a multivariate Gaussian and it is a well-known result that two components of a multivariate Gaussian are independent if and only if they are uncorrelated, so the condition

A Σ B^{T} = 0

$A\Sigma B^T = 0$ turns out to be exactly equivalent to the components

A Z

$AZ$ and

B Z

$BZ$ in

C Z

$CZ$ being uncorrelated.

$\square$

— jld
fonte

3

+1 always enjoy reading your answer.

— Haitao Du

9

@Chaconne's answer is great. But here is a much shorter nonmathematical version!

Since the goal is to compute a P value, you first need to define a null hypothesis. Almost always, that is that the slope is actually horizontal so the numerical value for the slope (beta) is 0.0.

The slope fit from your data is not 0.0. Is that discrepancy due to random chance or due to the null hypothesis being wrong? You can't ever answer that for sure, but a P value is one way to sort-of-kind-of get at an answer.

The regression program reports a standard error of the slope. Compute the t ratio as the slope divided by its standard error. Actually, it is (slope minus null hypothesis slope) divided by the standard error, but the null hypothesis slope is nearly always zero.

Now you have a t ratio. The number of degrees of freedom (df) equals the number of data points minus the number of parameters fit by the regression (two for linear regression).

With those values (t and df) you can determine the P value with an online calculator or table.

It is essentially a one-sample t-test, comparing an observed computed value (the slope) with a hypothetical value (the null hypothesis).

— Harvey Motulsky
fonte

3

The real question is why this is "essentially a one-sample t-test", and I don't see how it can become clear from your answer...

— amoeba says Reinstate Monica