Onde exatamente

Entendi que os SVMs são classificadores lineares binários (sem o truque do kernel). Eles possuem dados de treinamento onde é um vetor e é a classe. Por serem binários, classificadores lineares, a tarefa é encontrar um hiperplano que separa os pontos de dados com o rótulo dos pontos de dados com o rótulo . $(x_i, y_i)$ $x_i$ $y_i \in \{-1, 1\}$ $-1$ $+1$

Suponha, por enquanto, que os pontos de dados sejam linearmente separáveis e não precisamos de variáveis de folga.

Agora eu li que o problema de treinamento agora é o seguinte problema de otimização:

${\min_{w, b} \frac{1}{2} \|w\|^2}$
s.t. $y_i ( \langle w, x_i \rangle + b) \geq 1$

I think I got that minizmizing $\|w\|^2$ means maximizing the margin (however, I don't understand why it is the square here. Would anything change if one would try to minimize $\|w\|$ ?).

I also understood that $y_i ( \langle w, x_i \rangle + b) \geq 0$ means that the model has to be correct on the training data. However, there is a $1$ and not a $0$ . Why?

machine-learning svm

— Martin Thoma
fonte

In the math minimize (derivative = 0) the squared probably turns out to be an easier equation

— paparazzo

See also: Alexander Ihler: Support Vector Machines (1): Linear SVMs, primal form on YouTube. 25.01.2015.

— Martin Thoma

First problem: Minimizing $\|w\|$ or $\|w\|^2$ :

It is correct that one wants to maximize the margin. This is actually done by maximizing $\frac{2}{\|w\|}$ . This would be the "correct" way of doing it, but it is rather inconvenient. Let's first drop the $2$ , as it is just a constant. Now if $\frac{1}{\|w\|}$ is maximal, $\|w\|$ will have to be as small as possible. We can thus find the identical solution by minimizing $\|w\|$ .

$\|w\|$ can be calculated by $\sqrt{w^T w}$ . As the square root is a monotonic function, any point $x$ which maximizes $\sqrt{f(x)}$ will also maximize $f(x)$ . To find this point $x$ we thus don't have to calculate the square root and can minimize $w^T w = \|w\|^2$ .

Finally, as we often have to calculate derivatives, we multiply the whole expression by a factor $\frac{1}{2}$ . This is done very often, because if we derive $\frac{d}{dx} x^2 = 2 x$ and thus $\frac{d}{dx} \frac{1}{2} x^2 = x$ . This is how we end up with the problem: minimize $\frac{1}{2} \|w\|^2$ .

tl;dr: yes, minimizing $\|w\|$ instead of $\frac{1}{2} \|w\|^2$ would work.

Second problem: $\geq 0$ or $\geq 1$ :

As already stated in the question, $y_i \left( \langle w,x_i \rangle + b \right) \geq 0$ means that the point has to be on the correct side of the hyperplane. However this isn't enough: we want the point to be at least as far away as the margin (then the point is a support vector), or even further away.

Remember the definition of the hyperplane,

$\mathcal{H} = \{ x \mid \langle w,x \rangle + b = 0\}$ .

This description however is not unique: if we scale $w$ and $b$ by a constant $c$ , then we get an equivalent description of this hyperplane. To make sure our optimization algorithm doesn't just scale $w$ and $b$ by constant factors to get a higher margin, we define that the distance of a support vector from the hyperplane is always $1$ , i.e. the margin is $\frac{1}{\|w\|}$ . A support vector is thus characterized by $y_i \left( \langle w,x_i \rangle + b \right) = 1$ .

As already mentioned earlier, we want all points to be either a support vector, or even further away from the hyperplane. In training, we thus add the constraint $y_i \left( \langle w,x_i \rangle + b \right) \geq 1$ , which ensures exactly that.

tl;dr: Training points don't only need to be correct, they have to be on the margin or further away.

— hbaderts
fonte

Just to check if I understood it: Instead of writing

\geq 1

$\geq 1$ we could also use any constant

ϵ

$\epsilon$ and write

\geq ϵ

$\geq \epsilon$ , where

ϵ > 0

$\epsilon > 0$ ?

— Martin Thoma

In principle, yes. E.g. in soft-margin SVMs (where you allow for some misclassifications or points within the margin), you use

\geq 1 - ξ_{i}

$\geq 1-\xi_i$ so you can be

ξ_{i}

$\xi_i$ from the margin. Of course then you need some penalty term which forces most

ξ_{i}

$\xi_i$ to be zero or at least very low.

— hbaderts

I think in the above comment Martin was asking not about the case of soft margins where you add a

ξ_{i}

$\xi_i$ to let some points cross over, but rather just about what happens if you replace

1

$1$ with a different positive constant

ϵ

$\epsilon$ . I believe the result in that case would be the same (i.e., you'd find the same plane of separation) but

w

$w$ would be scaled so that the margin would be

\frac{2 ϵ}{‖ w ‖}

$\frac{2 \epsilon}{\|w\|}$ instead of

\frac{2}{‖ w ‖}

$\frac{2}{\|w\|}$

— Tim Goodman

This is because

⟨ w, x ⟩ + b = ϵ

$\langle w, x \rangle + b = \epsilon$ defines a plane orthogonal to

w

$w$ and offset from the origin by

\frac{ϵ - b}{‖ w ‖}

$\frac{\epsilon - b}{\|w\|}$ in the

w

$w$ direction. And likewise

- (⟨ w, x ⟩ + b) = ϵ

$-(\langle w, x \rangle + b) = \epsilon$ defines a plane orthogonal to

w

$w$ and offset from the origin by

\frac{- ϵ - b}{‖ w ‖}

$\frac{-\epsilon - b}{\|w\|}$ . So the distance between the two planes is

\frac{ϵ - b}{‖ w ‖} - \frac{- ϵ - b}{‖ w ‖} = \frac{2 ϵ}{‖ w ‖}

$\frac{\epsilon - b}{\|w\|} - \frac{-\epsilon - b}{\|w\|} = \frac{2 \epsilon}{\|w\|}$

— Tim Goodman