15

Estou tentando entender a intuição por trás dos SVMs do kernel. Agora, entendo como o SVM é linear, pelo qual é feita uma linha de decisão que divide os dados da melhor maneira possível. Também entendo o princípio por trás da transferência de dados para um espaço de maior dimensão e como isso pode facilitar a localização de uma linha de decisão linear nesse novo espaço. O que não entendo é como um kernel é usado para projetar pontos de dados para esse novo espaço.

O que eu sei sobre um kernel é que ele representa efetivamente a "similaridade" entre dois pontos de dados. Mas como isso se relaciona com a projeção?

machine-learning svm kernel-trick

— Karnivaurus
fonte

3

Se você for para um espaço dimensional alto o suficiente, todos os pontos de dados de treinamento poderão ser perfeitamente separados por um plano. Isso não significa que terá qualquer poder preditivo. Eu acho que ir para um espaço dimensional muito alto é o equivalente moral (uma forma de) de sobreajuste.

— Mark L. Stone

@ Mark L. Stone: isso está correto (+1), mas ainda pode ser uma boa pergunta perguntar como um kernel pode mapear no espaço dimensional infinito? Como isso funciona? Eu tentei, veja a minha resposta

Eu seria cuidadoso em chamar o recurso de mapeamento de "projeção". O mapeamento de recursos geralmente é uma transformação não linear.

— Paul

Uma publicação muito útil sobre o truque do kernel visualiza o espaço interno do produto do kernel e descreve como os vetores de recursos de alta dimensão são usados para conseguir isso, espero que isso responda de forma concisa à pergunta: eric-kim.net/eric-kim-net/ posts / 1 / kernel_trick.html

— JStrahl 22/11

6

Deixe $h(x)$ ser a projecção de espaço alto dimensão $\mathcal{F}$ . Basicamente, a função de núcleo $K(x_1,x_2)=\langle h(x_1),h(x_2)\rangle$ , que é o produto interno. Portanto, não é usado para projetar pontos de dados, mas um resultado da projeção. Pode ser considerado uma medida de semelhança, mas em um SVM, é mais do que isso.

A otimização para encontrar o melhor hiperplano de separação em $\mathcal{F}$ envolve $h(x)$ somente através da forma do produto interno. Ou seja, se você conhece $K(\cdot,\cdot)$ , não precisa saber a forma exata de $h(x)$ , o que facilita a otimização.

Cada kernel $K(\cdot,\cdot)$ possui um correspondente $h(x)$ . Portanto, se você estiver usando um SVM com esse kernel, encontrará implicitamente a linha de decisão linear no espaço em que $h(x)$ mapeia.

O capítulo 12 dos Elementos da aprendizagem estatística fornece uma breve introdução ao SVM. Isso fornece mais detalhes sobre a conexão entre o kernel e o mapeamento de recursos: http://statweb.stanford.edu/~tibs/ElemStatLearn/

— Lii
fonte

você quer dizer que para um kernel

K (x, y)

$K(x,y)$ existe um único

subjacente

h (x)

$h(x)$ ?

2

@fcoppens No; para um exemplo trivial, considere

e

. No entanto, existe um espaço único de Hilbert em reprodução, correspondente a esse kernel.

h

$h$

- h

$-h$

— Dougal

@ Dougal: Então eu posso concordar com você, mas na resposta acima foi dito 'um

correspondente ', então eu queria ter certeza. Para o RKHS, mas você acha que é possível explicar de uma maneira "intuitiva" como essa transformação

parece com um Kernel

?

h

$h$

h

$h$

K (x, y)

$K(x,y)$

@fcoppens Em geral, não; é difícil encontrar representações explícitas desses mapas. Para certos núcleos, isso não é muito difícil ou já foi feito antes.

— Dougal

1

@fcoppens você está certo, o h (x) não é único. Você pode fazer alterações facilmente em h (x), mantendo o produto interno <h (x), h (x ')> igual. No entanto, você pode considerá-las como funções básicas, e o espaço que elas ocupam (ou seja, o RKHS) é único.

— Lii

4

As propriedades úteis do SVM do kernel não são universais - elas dependem da escolha do kernel. Para obter intuição, é útil olhar para um dos kernels mais usados, o kernel Gaussiano. Notavelmente, esse kernel transforma o SVM em algo muito parecido com um classificador vizinho k-mais próximo.

Esta resposta explica o seguinte:

Por que a separação perfeita de dados positivos e negativos de treinamento é sempre possível com um kernel gaussiano de largura de banda suficientemente pequena (ao custo de sobreajuste)
Como essa separação pode ser interpretada como linear em um espaço de feição.
Como o kernel é usado para construir o mapeamento do espaço de dados para o espaço de recursos. Spoiler: o espaço de feição é um objeto matematicamente abstrato, com um produto interno abstrato incomum baseado no kernel.

1. Alcançar a separação perfeita

A separação perfeita é sempre possível com um kernel gaussiano devido às propriedades de localização do kernel, que levam a um limite de decisão arbitrariamente flexível. Para uma largura de banda suficientemente pequena do kernel, o limite de decisão parecerá que você desenhou pequenos círculos em torno dos pontos sempre que necessário para separar os exemplos positivos e negativos:

(Crédito: curso de aprendizado de máquina on-line de Andrew Ng ).

Então, por que isso ocorre de uma perspectiva matemática?

Considere a configuração padrão: você tem um kernel gaussiano e dados de treinamento $K(\mathbf{x},\mathbf{z}) = \exp(- ||\mathbf{x}-\mathbf{z}||^2 / \sigma^2)$ onde os valores de são . Queremos aprender uma função classificadora $(\mathbf{x}^{(1)},y^{(1)}), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(n)},y^{(n)})$ $y^{(i)}$ $\pm 1$

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} K (x^{(i)}, x)

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x})$

Agora, como vamos atribuir os pesos ? Precisamos de espaços dimensionais infinitos e um algoritmo de programação quadrática? Não, porque eu só quero mostrar que posso separar os pontos perfeitamente. Então eu faço um bilhão de vezes menor que a menor separação entre dois exemplos de treinamento, e apenas defino . Isto significa que todos os pontos de treinamento são um bilhão de sigmas além, tanto quanto o kernel está em causa, e cada ponto controla completamente o sinal de $w_i$ $\sigma$ $||\mathbf{x}^{(i)} - \mathbf{x}^{(j)}||$ $w_i = 1$ $\hat{y}$ no seu bairro. Formalmente, temos

\hat{y} (x^{(k)}) = \sum_{i = 1}^{n} y^{(k)} K (x^{(i)}, x^{(k)}) = y^{(k)} K (x^{(k)}, x^{(k)}) + \sum_{i \neq k} y^{(i)} K (x^{(i)}, x^{(k)}) = y^{(k)} + ϵ

$\hat{y}(\mathbf{x}^{(k)}) = \sum_{i=1}^n y^{(k)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} K(\mathbf{x}^{(k)},\mathbf{x}^{(k)}) + \sum_{i \neq k} y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} + \epsilon$

onde é algum valor arbitrariamente pequeno. Sabemos é pequena porque é um bilhão de sigmas longe de qualquer outro ponto, assim, para todos temos $\epsilon$ $\epsilon$ $\mathbf{x}^{(k)}$ $i \neq k$

K (x^{(i)}, x^{(k)}) = \exp (- | | x^{(i)} - x^{(k)} | |^{2} / σ^{2}) \approx 0.

$K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = \exp(- ||\mathbf{x}^{(i)} - \mathbf{x}^{(k)}||^2 / \sigma^2) \approx 0.$

Since $\epsilon$ is so small, $\hat{y}(\mathbf{x}^{(k)})$ definitely has the same sign as $y^{(k)}$ , and the classifier achieves perfect accuracy on the training data. In practice this would be terribly overfitting but it shows the tremendous flexibility of the Gaussian kernel SVM, and how it can act very similar to a nearest neighbor classifier.

2. Kernel SVM learning as linear separation

The fact that this can be interpreted as "perfect linear separation in an infinite dimensional feature space" comes from the kernel trick, which allows you to interpret the kernel as an abstract inner product some new feature space:

K (x^{(i)}, x^{(j)}) = ⟨ Φ (x^{(i)}), Φ (x^{(j)}) ⟩

$K(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x}^{(j)})\rangle$

where $\Phi(\mathbf{x})$ is the mapping from the data space into the feature space. It follows immediately that the $\hat{y}(\mathbf{x})$ function as a linear function in the feature space:

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), Φ (x) ⟩ = L (Φ (x))

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x})\rangle = L(\Phi(\mathbf{x}))$

where the linear function $L(\mathbf{v})$ is defined on feature space vectors $\mathbf{v}$ as

L (v) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), v ⟩

$L(\mathbf{v}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\mathbf{v}\rangle$

This function is linear in $\mathbf{v}$ because it's just a linear combination of inner products with fixed vectors. In the feature space, the decision boundary $\hat{y}(\mathbf{x}) = 0$ is just $L(\mathbf{v}) = 0$ , the level set of a linear function. This is the very definition of a hyperplane in the feature space.

3. How the kernel is used to construct the feature space

Kernel methods never actually "find" or "compute" the feature space or the mapping $\Phi$ explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function $K$ . It is possible to write down a formula for $\Phi$ but the feature space it maps to is quite abstract and is only really used for proving theoretical results about SVM. If you're still interested, here's how it works.

Basically we define an abstract vector space $V$ where each vector is a function from $\mathcal{X}$ to $\mathbb{R}$ . A vector $f$ in $V$ is a function formed from a finite linear combination of kernel slices:

f (x) = \sum_{i = 1}^{n} α_{i} K (x^{(i)}, x)

$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$ (Here the

x^{(i)}

$\mathbf{x}^{(i)}$ are just an arbitrary set of points and need not be the same as the training set.) It is convenient to write

f

$f$ more compactly as

f = \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}

$f = \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}$ where

K_{x} (y) = K (x, y)

$K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$ is a function giving a "slice" of the kernel at

x

$\mathbf{x}$ .

The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:

⟨ \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}, \sum_{j = 1}^{n} β_{j} K_{x^{(j)}} ⟩ = \sum_{i, j} α_{i} β_{j} K (x^{(i)}, x^{(j)})

$\langle \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}, \sum_{j=1}^n \beta_j K_{\mathbf{x}^{(j)}} \rangle = \sum_{i,j} \alpha_i \beta_j K(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$

This definition is very deliberate: its construction ensures the identity we need for linear separation, $\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = K(\mathbf{x},\mathbf{y})$ .

With the feature space defined in this way, $\Phi$ is a mapping $\mathcal{X} \rightarrow V$ , taking each point $\mathbf{x}$ to the "kernel slice" at that point:

Φ (x) = K_{x}, where K_{x} (y) = K (x, y) .

$\Phi(\mathbf{x}) = K_\mathbf{x}, \quad \text{where} \quad K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y}).$

You can prove that $V$ is an inner product space when $K$ is a positive definite kernel. See this paper for details.

— Paul
fonte

Great explanation, but I think you have missed a minus for the definition of the gaussian kernel. K(x,z)=exp(-||x−z||2/σ2) . As it's written, it does not make sense with the ϵ found in the part (1)

— hqxortn

1

For the background and the notations I refer to How to calculate decision boundary from support vectors?.

So the features in the 'original' space are the vectors $x_i$ , the binary outcome $y_i \in \{-1, +1\}$ and the Lagrange multipliers are $\alpha_i$ .

As said by @Lii (+1) the Kernel can be written as $K(x,y)=h(x) \cdot h(y)$ (' $\cdot$ ' represents the inner product.

I will try to give some 'intuitive' explanation of what this $h$ looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong.

I have to 'transform' my feature space (so my $x_i$ ) into some 'new' feature space in which the linear separation will be solved.

For each observation $x_i$ , I define functions $\phi_i(x)=K(x_i,x)$ , so I have a function $\phi_i$ for each element of my training sample. These functions $\phi_i$ span a vector space. The vector space spanned by the $\phi_i$ , note it $V=span(\phi_{i, i=1,2,\dots N})$ .

I will try to argue that is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space $V$ can be written as as a linear combination of the $\phi_i$ , i.e.: $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ are real numbers.

$N$ is the size of the training sample and therefore the dimension of the vector space $V$ can go up to $N$ , depending on whether the $\phi_i$ are linear independent. As $\phi_i(x)=K(x_i,x)$ (see supra, we defined $\phi$ in this way), this means that the dimension of $V$ depends on the kernel used and can go up to the size of the training sample.

The transformation, that maps my original feature space to $V$ is defined as

$\Phi: x_i \to \phi(x)=K(x_i, x)$ .

This map $\Phi$ maps my original feature space onto a vector space that can have a dimension that goed up to the size of my training sample.

Obviously, this transformation (a) depends on the kernel, (b) depends on the values $x_i$ in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of $V$ look like $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ , $\gamma_i$ are real numbers.

Looking at the function $f(x)$ in How to calculate decision boundary from support vectors? it can be seen that $f(x)=\sum_i y_i \alpha_i \phi_i(x)+b$ .

In other words, $f(x)$ is a linear combination of the $\phi_i$ and this is a linear separator in the V-space : it is a particular choice of the $\gamma_i$ namely $\gamma_i=\alpha_i y_i$ !

The $y_i$ are known from our observations, the $\alpha_i$ are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the $V$ -spave.

This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space $V$ , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample.

So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?

— Community
fonte

"for each element of my training sample" -- is element here referring to a row or column (i.e. feature )

— user1761806

what is x and x_i? If my X is an input of 5 columns, and 100 rows, what would x and x_i be?

— user1761806

@user1761806 an element is a row. The notation is explained in the link at the beginning of the answer

1

Transforme preditores (dados de entrada) em um espaço de recurso de alta dimensão. É suficiente especificar apenas o kernel para esta etapa e os dados nunca são explicitamente transformados no espaço de recursos. Esse processo é conhecido como truque do kernel.

Deixe-me explicar. O truque do kernel é a chave aqui. Considere o caso de um núcleo de função de base radial (RBF) aqui. Transforma a entrada em espaço dimensional infinito. A transformação da entrada $x$ para $\phi(x)$ pode ser representado como mostrado abaixo (extraído de http://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf )

O espaço de entrada é dimensional finito, mas o espaço transformado é dimensional infinito. Transformar a entrada em um espaço dimensional infinito é algo que acontece como resultado do truque do kernel. Aqui $x$ qual é a entrada e $\phi$ é a entrada transformada. Mas $\phi$ não é computado como é, em vez disso, o produto $\phi(x_i)^T\phi(x)$ é calculado, que é apenas o exponencial da norma entre $x_i$ e $x$ .

Existe uma pergunta relacionada Mapa de recursos para o kernel Gaussiano, para o qual existe uma boa resposta /stats//a/69767/86202 .

A função de saída ou decisão é uma função da matriz do kernel $K(x_i,x)=\phi(x_i)^T\phi(x)$ e não da entrada $x$ ou entrada transformada $\phi$ diretamente.

— prashanth
fonte

0

Mapear para uma dimensão superior é apenas um truque para resolver um problema definido na dimensão original; portanto, preocupações como sobreajustar seus dados entrando em uma dimensão com muitos graus de liberdade não são um subproduto do processo de mapeamento, mas são inerentes à sua definição de problema.

Basicamente, tudo o que o mapeamento faz é converter a classificação condicional na dimensão original em uma definição de plano na dimensão superior e, como existe uma relação de 1 para 1 entre o plano na dimensão superior e suas condições na dimensão inferior, você sempre pode mover entre os dois.

Tomando o problema de sobreajuste, claramente, é possível sobreaquecer qualquer conjunto de observações, definindo condições suficientes para isolar cada observação em sua própria classe, o que equivale a mapear seus dados para (n-1) D onde n é o número de suas observações .

Tomando o problema mais simples, onde suas observações são [[1, -1], [0,0], [1,1]] [[recurso, valor]], movendo-se para a dimensão 2D e separando seus dados com uma linha , você está simplesmente transformando a classificação condicional em feature < 1 && feature > -1 : 0para definir uma linha que passa (-1 + epsilon, 1 - epsilon). Se você tivesse mais pontos de dados e necessitasse de mais condições, basta adicionar mais um grau de liberdade à sua dimensão superior a cada nova condição que você definir.

Você pode substituir o processo de mapeamento para uma dimensão superior por qualquer processo que forneça uma relação de 1 para 1 entre as condições e os graus de liberdade do seu novo problema. Os truques do kernel simplesmente fazem isso.

— Hou
fonte

1

Como um exemplo diferente, considere o problema em que o fenômeno resulta em observações da forma de [x, floor(sin(x))]. Mapear seu problema em uma dimensão 2D não é útil aqui; de fato, o mapeamento para qualquer plano não será útil aqui, porque definir o problema como um conjunto de x < a && x > b : znão é útil nesse caso. O mapeamento mais simples nesse caso é o mapeamento em uma coordenada polar ou no plano imaginário.

— Hou

Kernel SVM: Eu quero uma compreensão intuitiva do mapeamento para um espaço de maior dimensão e como isso torna possível a separação linear

1. Alcançar a separação perfeita

2. Kernel SVM learning as linear separation

3. How the kernel is used to construct the feature space