Diferença entre regressão Primimal, Dual e Kernel Ridge

Qual é a diferença entre Regressão Primal , Dual e Kernel Ridge? As pessoas estão usando os três, e por causa da notação diferente que todo mundo usa em diferentes fontes é difícil para mim seguir.

Então, alguém pode me dizer em palavras simples qual é a diferença entre esses três? Além disso, quais poderiam ser algumas vantagens ou desvantagens de cada um e qual pode ser sua complexidade?

regression kernel-trick ridge-regression

— Jim Blum
fonte

Resposta curta: não há diferença entre Primal e Dual - é apenas a maneira de chegar à solução. A regressão do cume do kernel é essencialmente a mesma que a regressão do cume usual, mas usa o truque do kernel para se tornar não linear.

Regressão linear

Antes de tudo, uma regressão linear de mínimos quadrados usual tenta ajustar uma linha reta ao conjunto de pontos de dados de maneira que a soma dos erros ao quadrado seja mínima.

insira a descrição da imagem aqui

Nós parametrizar a melhor linha de ajuste com $\mathbb w$ e para cada ponto de dados $(\mathbf x_i, y_i)$ queremos $\mathbf w^T \mathbf x_i \approx y_i$ . Seja $e_i = y_i - \mathbf w^T \mathbf x_i$ o erro - a distância entre os valores previstos e os verdadeiros. Assim, nosso objetivo é minimizar a soma dos erros quadrados $\sum e_i^2 = \| \mathbf e \|^2 = \| X \mathbf w - \mathbf y \|^2$ , onde $X = \begin{bmatrix} — \mathbf x_1 \,— \\ — \mathbf x_2 \,— \\ \vdots \\ — \mathbf x_n \,— \end{bmatrix}$ - uma matriz de dados, com cada $\mathbf x_i$ sendo uma linha, e $\mathbf y = (y_1 , \ ... \ , y_n)$ um vector com todos $y_i$ 's.

Assim, o objetivo é $\min\limits_{\mathbf w} \| X \mathbf w - \mathbf y \|^2$ , e a solução é $\mathbf w = (X^T X)^{-1} X^T \mathbf y$ (conhecida como "Equação Normal").

Para um novo ponto de dados invisível $\mathbf x$ podemos prever o seu valor alvo como . $\hat y$ $\hat y = \mathbf w^T \mathbf x$

Regressão de Ridge

Quando existem muitas variáveis correlacionadas nos modelos de regressão linear, os coeficientes $\mathbf w$ podem se tornar mal determinados e ter muita variação. Uma das soluções para este problema é restringir pesos $\mathbf w$ para que eles não excedam algum orçamento $C$ . Isso é equivalente a usar $L_2$ -regularization, também conhecido como "peso decaimento": ele vai diminuir a variância ao custo de, por vezes, faltando os resultados correctos (isto é, através da introdução de alguns polarização).

O objetivo agora se torna $\min\limits_{\mathbf w} \| X \mathbf w - y \|^2 + \lambda \, \| \mathbf w \|^2$ , com $\lambda$ sendo a parâmetro de regularização. Analisando a matemática, obtemos a seguinte solução: $\mathbf w = (X^T X + \lambda \, I )^{-1} X^T \mathbf y$ . É muito semelhante à regressão linear de costume, mas aqui vamos adicionar $\lambda$ a cada elemento diagonal de $X^T X$ .

Observe que podemos reescrever $\mathbf w$ como $\mathbf w = X^T \, (X X^T + \lambda \, I)^{-1} \mathbf y$ (vejaaquipara detalhes). Para um novo ponto de dados invisível $\mathbf x$ podemos prever o seu valor alvo como $\hat y$ $\hat y = \mathbf x^T \mathbf w = \mathbf x^T X^T \, (X X^T + \lambda \, I)^{-1} \mathbf y$ . Seja $\boldsymbol \alpha = (X X^T + \lambda \, I)^{-1} \mathbf y$ . Em seguida, $\hat y = \mathbf x^T X^T \boldsymbol \alpha = \sum\limits_{i=1}^{n} \alpha_i \cdot \mathbf x^T \mathbf x_i$ .

Regressão de Ridge Dual Form

Podemos ter uma visão diferente de nosso objetivo - e definir o seguinte problema quadrático do programa:

$\min\limits_{\mathbf e, \mathbf w} \sum\limits_{i = 1}^n e_i^2$ st $e_i = y_i - \mathbf w^T \mathbf x_i$ para $i = 1 \, .. \, n$ e $\| \mathbf w \|^2 \leqslant C$ .

É o mesmo objetivo, mas expresso de maneira um pouco diferente, e aqui a restrição sobre o tamanho de $\mathbf w$ é explícita. Para resolver isto, nós definimos o Lagrangeanos $\mathcal L_p(\mathbf w, \mathbf e ; C)$ - esta é a forma primitiva que contém variáveis primárias $\mathbf w$ e $\mathbf e$ . Em seguida, otimizamos o wrt $\mathbf e$ e $\mathbf w$ . Para obter a formulação dupla, colocamos $\mathbf e$ e $\mathbf w$ volta em $\mathcal L_p(\mathbf w, \mathbf e ; C)$ .

Assim, $\mathcal L_p(\mathbf w, \mathbf e ; C) = \| \mathbf e \|^2 + \boldsymbol \beta^T (\mathbf y - X \mathbf w - \mathbf e) - \lambda \, (\| \mathbf w \|^2 - C)$ . Ao tomar derivados wrt $\mathbf w$ e $\mathbf e$ , obtemos $\mathbf e = \cfrac{1}{2} \boldsymbol \beta$ e $\mathbf w = \cfrac{1}{2 \lambda} X^T \boldsymbol \beta$ . Ao deixar $\boldsymbol \alpha = \cfrac{1}{2 \lambda} \boldsymbol \beta$ , e colocando $\mathbf e$ e $\mathbf w$ volta para $\mathcal L_p(\mathbf w, \mathbf e ; C)$ , obtemos dupla de Lagrange $\mathcal L_d(\boldsymbol \alpha, \lambda; C) = -\lambda^2 \| \boldsymbol \alpha \|^2 + 2 \lambda \, \boldsymbol \alpha^T y - \lambda \| X^T \boldsymbol \alpha \| - \lambda C$ . Se tomarmos uma derivada wrt $\boldsymbol \alpha$ , obtemos $\boldsymbol \alpha = (XX^T - \lambda I)^{-1} \mathbf y$ - a mesma resposta que para a regressão usual de Kernel Ridge. Não há necessidade de usar uma derivada wrt $\lambda$ - depende de $C$ , que é um parâmetro de regularização - e também faz $\lambda$ parâmetro de regularização.

Em seguida, coloque $\boldsymbol \alpha$ na solução da forma primária para $\mathbf w$ e obtenha $\mathbf w = \cfrac{1}{2 \lambda} X^T \boldsymbol \beta = X^T \boldsymbol \alpha$ . Assim, a forma dupla fornece a mesma solução que a Regressão de Ridge usual, e é apenas uma maneira diferente de chegar à mesma solução.

Regressão de Kernel Ridge

Os kernels são usados para calcular o produto interno de dois vetores em algum espaço de recurso sem sequer visitá-lo. Podemos visualizar um kernel $k$ como $k(\mathbf x_1, \mathbf x_2) = \phi(\mathbf x_1)^T \phi(\mathbf x_2)$ , embora não saibamos o que $\phi(\cdot)$ - apenas sabemos que ele existe. Existem muitos kernels, por exemplo, RBF, Polynonial, etc.

$k(\mathbf x_1, \mathbf x_2) = \phi(\mathbf x_1)^T \phi(\mathbf x_2)$ $\Phi(X)$ $\phi(\mathbf x_i)$ $\Phi(X) = \begin{bmatrix} — \phi(\mathbf x_1) \,— \\ — \phi(\mathbf x_2) \,— \\ \vdots \\ — \phi(\mathbf x_n) \,— \end{bmatrix}$

Now we can just take the solution for Ridge Regression and replace every $X$ with $\Phi(X)$ : $\mathbf w = \Phi(X)^T \, (\Phi(X) \Phi(X)^T + \lambda \, I)^{-1} \mathbf y$ . For a new unseen data point $\mathbf x$ we predict its target value $\hat y$ as $\hat y= \mathbf \phi(\mathbf x)^T \Phi(X)^T \, (\Phi(X) \Phi(X)^T + \lambda \, I)^{-1} \mathbf y$ .

First, we can replace $\Phi(X) \Phi(X)^T$ by a matrix $K$ , calculated as $(K)_{ij} = k(\mathbf x_i, \mathbf x_j)$ . Then, $\phi(\mathbf x)^T \Phi(X)^T$ is $\sum\limits_{i = 1}^n \phi(\mathbf x)^T \phi(\mathbf x_i) = \sum\limits_{i = 1}^n k(\mathbf x, \mathbf x_j)$ . So here we managed to express every dot product of the problem in terms of kernels.

Finally, by letting $\boldsymbol \alpha = (K + \lambda \, I)^{-1} \mathbf y$ (as previously), we obtain $\hat y= \sum\limits_{i = 1}^n \alpha_i k(\mathbf x, \mathbf x_j)$

References

Machine Learning I class at TU Berlin
Elements of Statistical Learning, http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://0agr.ru/wiki/index.php/Normal_Equation
http://stat.wikia.com/wiki/Kernel_Ridge_Regression
http://stat.rutgers.edu/home/tzhang/papers/ml02_dual.pdf
http://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf
http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf

— Alexey Grigorev
fonte

I am impressed by the well-organized discussion. However, your early reference to "outliers" confused me. It appears the weights

w

$w$ apply to the variables rather than the cases, so how exactly would ridge regression help make the solution robust to outlying cases, as suggested by the illustration?

— whuber

Excellent answer, Alexey (though I wouldn't call it "simple words")! +1 with no questions asked. You like to write in LaTeX, don't you?

— Aleksandr Blekh

I suspect you might be confusing some basic things here. AFAIK, ridge regression is neither a response to nor a way of coping with "noisy observations." OLS already does that. Ridge regression is a tool used to cope with near-collinearity among regressors. Those phenomena are completely different from noise in the dependent variable.

— whuber

+1 whuber. Alexey you are right it is overfitting -ie too many parameters for the available data - not really noise. [ and add enough dimensions for fixed sample size and 'any' data set becomes collinear]. So a better 2-d picture for RR would be all the points clustered around (0,1) with a single point at (1,0) ['justifying' the slope parameter]. See ESL fig 3.9,page 67 web.stanford.edu/~hastie/local.ftp/Springer/OLD/…. also look at primal cost function: to increase weight by 1 unit, error must decrease by

1 / λ

$1/\lambda$ unit

— seanv507

I believe you meant add

λ

$\lambda$ to diagonal elements of

X^{T} X

$X^TX$ not subtract(?) in the ridge regression section. I applied the edit.

— Heteroskedastic Jim