Como você pode lidar com estimativas instáveis de

Estabilidade beta em regressão linear com alta multicolinearidade?

Digamos que em uma regressão linear, as variáveis $x_1$ e $x_2$ têm alta multicolinearidade (a correlação é de cerca de 0,9).

Estamos preocupados com a estabilidade do coeficiente $\beta$ , portanto temos que tratar a multicolinearidade.

A solução do livro seria apenas jogar fora uma das variáveis.

Mas não queremos perder informações úteis simplesmente jogando fora as variáveis.

Alguma sugestão?

regression multicollinearity

— Luna
fonte

Você já tentou algum tipo de esquema de regularização (por exemplo, regressão de crista)?

— Néstor

Respostas:

Você pode tentar a abordagem de regressão de crista no caso em que a matriz de correlação é quase singular (ou seja, as variáveis têm altas correlações). Ele fornecerá uma estimativa robusta de . $\beta$

A única questão é como escolher o parâmetro de regularização . Não é um problema simples, embora eu sugira tentar valores diferentes. $\lambda$

Espero que isto ajude!

— Paulo
fonte

A validação cruzada é a coisa usual a fazer para escolher

;-).

λ

$\lambda$

— Néstor

de fato (+1 para resposta e comentário de Nestors) e, se você realizar os cálculos em "forma canônica" (usando uma decomposição autônoma de

, poderá encontrar o

minimizando o erro de validação cruzada de exclusão por: o método de Newton muito mais barato.

X^{T} X

$X^TX$

λ

$\lambda$

— Dikran Marsupial

Muito obrigado! Algum tutorial / notas sobre como fazer isso, incluindo a validação cruzada no R?

— Luna

Confira o capítulo 3 deste livro: stanford.edu/~hastie/local.ftp/Springer/ESLII_print5.pdf . A implementação da regressão de cume é feita em R por alguns dos autores (o Google é seu amigo!).

— Néstor

Você pode usar a lm.ridgerotina no pacote MASS. Se você passar um intervalo de valores para

, por exemplo, uma chamada como , você retornará as estatísticas generalizadas de validação cruzada e poderá plotá-las contra

: para escolher o mínimo.

λ

$\lambda$ foo <- lm.ridge(y~x1+x2,lambda=seq(0,10,by=0.1))foo

λ

$\lambda$ plot(foo$GCV~foo$lambda)

— 21132 jbowman

Bem, há um método ad hoc que eu já usei antes. Não tenho certeza se esse procedimento tem um nome, mas faz sentido intuitivamente.

Suponha que seu objetivo seja ajustar o modelo

Y_{i} = β_{0} + β_{1} X_{i} + β_{2} Z_{i} + ε_{i}

$Y_i = \beta_0 + \beta_1 X_i + \beta_2 Z_i + \varepsilon_i$

onde os dois preditores - - estão altamente correlacionados. Como você apontou, usá-los no mesmo modelo pode fazer coisas estranhas às estimativas do coeficiente e valores- . Uma alternativa é ajustar o modelo $X_i, Z_i$ $p$

Z_{i} = α_{0} + α_{1} X_{i} + η_{i}

$Z_i = \alpha_0 + \alpha_1 X_i + \eta_i$

Em seguida, o resíduo será não correlacionado com e pode, de alguma forma, ser considerada como a parte de que não é absorvido pela sua relação linear com . Em seguida, você pode prosseguir para ajustar o modelo $\eta_i$ $X_i$ $Z_i$ $X_i$

Y_{i} = θ_{0} + θ_{1} X_{i} + θ_{2} η_{i} + ν_{i}

$Y_i = \theta_0 + \theta_1 X_i + \theta_2 \eta_i + \nu_i$

que irá capturar todos os efeitos do primeiro modelo (e, de fato, têm exatamente o mesmo como o primeiro modelo), mas os preditores já não são colineares. $R^2$

Editar: O OP solicitou uma explicação de por que os resíduos não possuem, por definição, uma correlação de amostra zero com o preditor quando você omite a interceptação, como eles fazem quando a interceptação é incluída. Como é muito longo para postar nos comentários, fiz uma edição aqui. Essa derivação não é particularmente esclarecedora (infelizmente não consegui apresentar um argumento intuitivo razoável), mas mostra o que o OP solicitou :

Quando a intercepção é omitido na regressão linear simples , $\hat \beta = \frac{ \sum x_i y_i}{\sum x_i^2}$ $e_i = y_i - x_i \frac{ \sum x_i y_i}{\sum x_i^2}$ $x_i$ $e_i$

\bar{x e} - \bar{x} \bar{e}

$\overline{xe} - \overline{x}\overline{e}$

\bar{\cdot}

$\overline{\cdot}$

Primeiro nós temos

\bar{x e} = \frac{1}{n} (\sum x_{i} y_{i} - x_{i}^{2} \cdot \frac{\sum x_{i} y_{i}}{\sum x_{i}^{2}}) = \bar{x y} (1 - \frac{\sum x_{i}^{2}}{\sum x_{i}^{2}}) = 0

$\overline{xe} = \frac{1}{n} \left( \sum x_i y_i - x_{i}^2 \cdot \frac{ \sum x_i y_i}{\sum x_i^2} \right) = \overline{xy} \left( 1 - \frac{ \sum x_{i}^2}{ \sum x_{i}^2 } \right) = 0$

but

\bar{x} \bar{e} = \bar{x} (\bar{y} - \frac{\bar{x} \cdot \bar{x y}}{\bar{x^{2}}}) = \bar{x} \bar{y} - \frac{{\bar{x}}^{2} \cdot \bar{x y}}{\bar{x^{2}}}

$\overline{x} \overline{e} = \overline{x} \left( \overline{y} - \frac{\overline{x} \cdot \overline{xy}}{\overline{x^2}} \right) = \overline{x}\overline{y} - \frac{\overline{x}^2 \cdot \overline{xy}}{\overline{x^2}}$

so in order for the $e_i$ and $x_i$ to have a sample correlation of exactly 0, we need $\overline{x}\overline{e}$ to be $0$ . That is, we need

\bar{y} = \frac{\bar{x} \cdot \bar{x y}}{\bar{x^{2}}}

$\overline{y} = \frac{ \overline{x} \cdot \overline{xy}}{\overline{x^2}}$

which does not hold in general for two arbitrary sets of data $x, y$ .

— Macro
fonte

This reminds me of partial regression plots.

— Andy W

This sounds like an approximation to replacing

(X, Z)

$(X, Z)$ by their principal components.

— whuber

One thing I had in mind is that PCA generalizes easily to more than two variables. Another is that it treats

X

$X$ and

Z

$Z$ symmetrically, whereas your proposal appears arbitrarily to single out one of these variables. Another thought was that PCA provides a disciplined way to reduce the number of variables (although one must be cautious about that, because a small principal component may be highly correlated with the dependent variable).

— whuber

Hi Macro, Thank you for the excellent proof. Yeah now I understand it. When we talk about sample correlation between x and residuals, it requires the intercept term to be included for the sample correlation to be 0. On the other hand, when we talk about orthogonality between x and residuals, it doesn't require the intercept term to be included, for the orthogonality to hold.

— Luna

@Luna, I don't particularly disagree with using ridge regression - this was just what first occurred to me (I answered before that was suggested). One thing I can say is that ridge regression estimate are biased, so, in some sense, you're actually estimating a slightly different (shrunken) quantity than you are with ordinary regression, making the interpretation of the coefficients perhaps more challenging (as gung alludes to). Also, what I've described here only requires understanding of basic linear regression and may be more intuitively appealing to some.

— Macro

I like both of the answers given thus far. Let me add a few things.

Another option is that you can also combine the variables. This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. This would be a good approach when you believe they are two different measures of the same underlying construct. In that case, you have two measurements that are contaminated with error. The most likely true value for the variable you really care about is in between them, thus averaging them gives a more accurate estimate. You standardize them first to put them on the same scale, so that nominal issues don't contaminate the result (e.g., you wouldn't want to average several temperature measurements if some are Fahrenheit and some are Celsius). Of course, if they are already on the same scale (e.g., several highly-correlated public opinion polls), you can skip that step. If you think one of your variables might be more accurate than the other, you could do a weighted average (perhaps using the reciprocals of the measurement errors).

If your variables are just different measures of the same construct, and are sufficiently highly correlated, you really could just throw one out without losing much information. As an example, I was actually in a situation once, where I wanted to use a covariate to absorb some of the error variance and boost power, but where I didn't care about that covariate--it wasn't germane substantively. I had several options available and they were all correlated with each other $r>.98$ . I basically picked one at random and moved on, and it worked fine. I suspect I would have lost power burning two extra degrees of freedom if I had included the others as well by using some other strategy. Of course, I could have combined them, but why bother? However, this depends critically on the fact that your variables are correlated because they are two different versions of the same thing; if there's a different reason they are correlated, this could be totally inappropriate.

As that implies, I suggest you think about what lies behind your correlated variables. That is, you need a theory of why they're so highly correlated to do the best job of picking which strategy to use. In addition to different measures of the same latent variable, some other possibilities are a causal chain (i.e., $X_1\rightarrow X_2\rightarrow Y$ ) and more complicated situations in which your variables are the result of multiple causal forces, some of which are the same for both. Perhaps the most extreme case is that of a suppressor variable, which @whuber describes in his comment below. @Macro's suggestion, for instance, assumes that you are primarily interested in $X$ and wonder about the additional contribution of $Z$ after having accounted for $X$ 's contribution. Thus, thinking about why your variables are correlated and what you want to know will help you decide which (i.e., $x_1$ or $x_2$ ) should be treated as $X$ and which $Z$ . The key is to use theoretical insight to inform your choice.

I agree that ridge regression is arguably better, because it allows you to use the variables you had originally intended and is likely to yield betas that are very close to their true values (although they will be biased--see here or here for more information). Nonetheless, I think is also has two potential downsides: It is more complicated (requiring more statistical sophistication), and the resulting model is more difficult to interpret, in my opinion.

I gather that perhaps the ultimate approach would be to fit a structural equation model. That's because it would allow you to formulate the exact set of relationships you believe to be operative, including latent variables. However, I don't know SEM well enough to say anything about it here, other than to mention the possibility. (I also suspect it would be overkill in the situation you describe with just two covariates.)

— gung - Reinstate Monica
fonte

Re the first point: Let vector

X_{1}

$X_1$ have a range of values and let vector

e

$e$ have small values completely uncorrelated with

X_{1}

$X_1$ so that

X_{2} = X_{1} + e

$X_2=X_1+e$ is highly correlated with

X_{1}

$X_1$ . Set

Y = e

$Y=e$ . In the regression of

Y

$Y$ against either

X_{1}

$X_1$ or

X_{2}

$X_2$ you will see no significant or important results. In the regression of

Y

$Y$ against

X_{1}

$X_1$ and

X_{2}

$X_2$ you will get an extremely good fit, because

Y = X_{2} - X_{1}

$Y=X_2-X_1$ . Thus, if you throw out either of

X_{1}

$X_1$ or

X_{2}

$X_2$ , you will have lost essentially all information about

Y

$Y$ . Whence, "highly correlated" does not mean "have equivalent information about

Y

$Y$ ".

— whuber

Thanks a lot Gung! Q1. Why does this approach work: "This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. "? Q2. Why would Ridge Regression be better? Q3. Why would SEM be better? Anybody please shed some lights on this? Thank you!

— Luna

Hi Luna, glad to help. I'm actually going to re-edit this; @whuber was more right than I had initially realized. I'll try to put in more to help w/ your additional questions, but it'll take a lot, so it might be a while. We'll see how it goes.

— gung - Reinstate Monica

Como você pode lidar com estimativas instáveis ​​de

Como você pode lidar com estimativas instáveis de