Qual função de perda está correta para a regressão logística?

31

Eu li sobre duas versões da função de perda para regressão logística, qual delas está correta e por quê?

No Machine Learning , Zhou ZH (em chinês), com $\beta = (w, b)\text{ and }\beta^Tx=w^Tx +b$ :

$\begin{matrix} (1) & l (β) = \sum_{i = 1}^{m} (- y_{i} β^{T} x_{i} + \ln (1 + e^{β^{T} x_{i}})) \end{matrix}$ $l(\beta) = \sum\limits_{i=1}^{m}\Big(-y_i\beta^Tx_i+\ln(1+e^{\beta^Tx_i})\Big) \tag 1$
Do meu curso na faculdade, com $z_i = y_if(x_i)=y_i(w^Tx_i + b)$ :

$\begin{matrix} (2) & L (z_{i}) = \log (1 + e^{- z_{i}}) \end{matrix}$ $L(z_i)=\log(1+e^{-z_i}) \tag 2$

Eu sei que o primeiro é um acúmulo de todas as amostras e o segundo é para uma única amostra, mas estou mais curioso sobre a diferença na forma de duas funções de perda. De alguma forma, sinto que eles são equivalentes.

logistic loss-functions

— xtt
fonte

31

O relacionamento é como segue: $l(\beta) = \sum_i L(z_i)$ .

Defina uma função logística como . Eles possuem a propriedade que $f(z) = \frac{e^{z}}{1 + e^{z}} = \frac{1}{1+e^{-z}}$ . Ou em outras palavras: $f(-z) = 1-f(z)$

\frac{1}{1 + e^{z}} = \frac{e^{- z}}{1 + e^{- z}} .

$\frac{1}{1+e^{z}} = \frac{e^{-z}}{1+e^{-z}}.$

Se você considerar o recíproco de ambos os lados, faça o log que você obtém:

\ln (1 + e^{z}) = \ln (1 + e^{- z}) + z .

$\ln(1+e^{z}) = \ln(1+e^{-z}) + z.$

Subtraia de ambos os lados e você verá o seguinte: $z$

- y_{i} β^{T} x_{i} + l n (1 + e^{y_{i} β^{T} x_{i}}) = L (z_{i}) .

$-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i}) = L(z_i).$

Editar:

No momento, estou relendo esta resposta e estou confuso sobre como eu consegui ser igual a $-y_i\beta^Tx_i+ln(1+e^{\beta^Tx_i})$ $-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i})$ . Talvez haja um erro de digitação na pergunta original.

Edição 2:

Caso não tenha havido um erro de digitação na pergunta original, @ManelMorales parece estar correto ao chamar a atenção para o fato de que, quando , a função de massa de probabilidade pode ser escrita como , devido à propriedade que $y \in \{-1,1\}$ $P(Y_i=y_i) = f(y_i\beta^Tx_i)$ $f(-z) = 1 - f(z)$ . Estou reescrevendo-o de maneira diferente aqui, porque ele introduz um novo equívoco na notação $z_i$ . O restante segue considerando a probabilidade logarítmica negativa para cadacodificação . Veja a resposta dele abaixo para mais detalhes. $y$

— Taylor
fonte

42

O OP acredita erroneamente que a relação entre essas duas funções se deve ao número de amostras (ou seja, uma única vs todas). No entanto, a diferença real é simplesmente como selecionamos nossos rótulos de treinamento.

In the case of binary classification we may assign the labels $y=\pm1$ or $y=0,1$ .

As it has already been stated, the logistic function $\sigma(z)$ is a good choice since it has the form of a probability, i.e. $\sigma(-z)=1-\sigma(z)$ and $\sigma(z)\in (0,1)$ as $z\rightarrow \pm \infty$ . If we pick the labels $y=0,1$ we may assign

\begin{aligned} P (y = 1 | z) & = σ (z) = \frac{1}{1 + e^{- z}} \\ P (y = 0 | z) & = 1 - σ (z) = \frac{1}{1 + e^{z}} \end{aligned}

$\begin{equation} \begin{aligned} \mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\ \mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\ \end{aligned} \end{equation}$

which can be written more compactly as $\mathbb{P}(y|z) =\sigma(z)^y(1-\sigma(z))^{1-y}$ .

It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For $m$ samples $\{x_i,y_i\}$ , after taking the natural logarithm and some simplification, we will find out:

\begin{aligned} l (z) = - \log (\prod_{i}^{m} P (y_{i} | z_{i})) = - \sum_{i}^{m} \log (P (y_{i} | z_{i})) = \sum_{i}^{m} - y_{i} z_{i} + \log (1 + e^{z_{i}}) \end{aligned}

$\begin{equation} \begin{aligned} l(z)=-\log\big(\prod_i^m\mathbb{P}(y_i|z_i)\big)=-\sum_i^m\log\big(\mathbb{P}(y_i|z_i)\big)=\sum_i^m-y_iz_i+\log(1+e^{z_i}) \end{aligned} \end{equation}$

Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels $y=\pm 1$ . It is pretty obvious then that we can assign

P (y | z) = σ (y z) .

$\begin{equation} \mathbb{P}(y|z)=\sigma(yz). \end{equation}$

It is also obvious that $\mathbb{P}(y=0|z)=\mathbb{P}(y=-1|z)=\sigma(-z)$ . Following the same steps as before we minimize in this case the loss function

\begin{aligned} L (z) = - \log (\prod_{j}^{m} P (y_{j} | z_{j})) = - \sum_{j}^{m} \log (P (y_{j} | z_{j})) = \sum_{j}^{m} \log (1 + e^{- y z_{j}}) \end{aligned}

$\begin{equation} \begin{aligned} L(z)=-\log\big(\prod_j^m\mathbb{P}(y_j|z_j)\big)=-\sum_j^m\log\big(\mathbb{P}(y_j|z_j)\big)=\sum_j^m\log(1+e^{-yz_j}) \end{aligned} \end{equation}$

Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form $y$ takes different values, nevertheless these two are equivalent:

\begin{aligned} - y_{i} z_{i} + \log (1 + e^{z_{i}}) \equiv \log (1 + e^{- y z_{j}}) \end{aligned}

$\begin{equation} \begin{aligned} -y_iz_i+\log(1+e^{z_i})\equiv \log(1+e^{-yz_j}) \end{aligned} \end{equation}$

The case $y_i=1$ is trivial to show. If $y_i \neq 1$ , then $y_i=0$ on the left hand side and $y_i=-1$ on the right hand side.

While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property $\partial \sigma(z) / \partial z=\sigma(z)(1-\sigma(z))$ to trivially calculate $\nabla l(z)$ and $\nabla^2l(z)$ , both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).

— Manuel Morales
fonte

Is logistic loss function convex?

— user85361

2

Log reg

l (z)

$l(z)$ IS convex, but not

α

$\alpha$ -convex. Thus we can't place a bound on how long gradient descent takes to converge. We can adjust the form of

l

$l$ to make it strongly convex by adding a regularization term: with positive constant

λ

$\lambda$ define our new function to be

l^{'} (z) = l (z) + λ ‖ z ‖^{2}

$l'(z)=l(z)+\lambda\|z\|^2$ s.t

l^{'} (z)

$l'(z)$ is

λ

$\lambda$ -strongly convex and we can now prove the convergence bound of

l^{'}

$l'$ . Unfortunately, we are now minimizing a different function! Luckily, we can show that the value of the optimum of the regularized function is close to the value of the optimum of the original.

— Manuel Morales

The notebook you referred has gone, I got another proof: statlect.com/fundamentals-of-statistics/…

— Domi.Zhang

2

I found this to be the most helpful answer.

— mohit6up

@ManuelMorales Do you have a link to the regularized function's optimum value being close to the original?

— Mark

19

I learned the loss function for logistic regression as follows.

Logistic regression performs binary classification, and so the label outputs are binary, 0 or 1. Let $P(y=1|x)$ be the probability that the binary output $y$ is 1 given the input feature vector $x$ . The coefficients $w$ are the weights that the algorithm is trying to learn.

P (y = 1 | x) = \frac{1}{1 + e^{- w^{T} x}}

$P(y=1|x) = \frac{1}{1 + e^{-w^{T}x}}$

Because logistic regression is binary, the probability $P(y=0|x)$ is simply 1 minus the term above.

P (y = 0 | x) = 1 - \frac{1}{1 + e^{- w^{T} x}}

$P(y=0|x) = 1- \frac{1}{1 + e^{-w^{T}x}}$

The loss function $J(w)$ is the sum of (A) the output $y=1$ multiplied by $P(y=1)$ and (B) the output $y=0$ multiplied by $P(y=0)$ for one training example, summed over $m$ training examples.

J (w) = \sum_{i = 1}^{m} y^{(i)} \log P (y = 1) + (1 - y^{(i)}) \log P (y = 0)

$J(w) = \sum_{i=1}^{m} y^{(i)} \log P(y=1) + (1 - y^{(i)}) \log P(y=0)$

where $y^{(i)}$ indicates the $i^{th}$ label in your training data. If a training instance has a label of $1$ , then $y^{(i)}=1$ , leaving the left summand in place but making the right summand with $1-y^{(i)}$ become $0$ . On the other hand, if a training instance has $y=0$ , then the right summand with the term $1-y^{(i)}$ remains in place, but the left summand becomes $0$ . Log probability is used for ease of calculation.

If we then replace $P(y=1)$ and $P(y=0)$ with the earlier expressions, then we get:

J (w) = \sum_{i = 1}^{m} y^{(i)} \log (\frac{1}{1 + e^{- w^{T} x}}) + (1 - y^{(i)}) \log (1 - \frac{1}{1 + e^{- w^{T} x}})

$J(w) = \sum_{i=1}^{m} y^{(i)} \log \left(\frac{1}{1 + e^{-w^{T}x}}\right) + (1 - y^{(i)}) \log \left(1- \frac{1}{1 + e^{-w^{T}x}}\right)$

You can read more about this form in these Stanford lecture notes.

— stackoverflowuser2010
fonte

This answer also provides some relevant perspective here.

— GeoMatt22

6

The expression you have is not a loss (to be minimized), but rather a log-likelihood (to be maximized).

— xenocyon

2

@xenocyon true - this same formulation is typically written with a negative sign applied to the full summation.

— Alex Klibisz

1

Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.

\begin{aligned} j (θ) & = \frac{1}{m} \sum_{i = 1}^{m} C o s t (h_{θ} (x^{(i)}), y^{(i)}) \\ C o s t (h_{θ} (x), y) & = - \log (h_{θ} (x)) & i f y & = 1 \\ C o s t (h_{θ} (x), y) & = - \log (1 - h_{θ} (x)) & i f y & = 0 \end{aligned}

$\begin{align}\newcommand{\Cost}{{\rm Cost}}\newcommand{\if}{{\rm if}} j(\theta) &= \frac 1 m \sum_{i=1}^m \Cost(h_\theta(x^{(i)}), y^{(i)}) & & \\ \Cost(h_\theta(x), y) &= -\log(h_\theta(x)) & \if\ y &= 1 \\ \Cost(h_\theta(x), y) &= -\log(1-h_\theta(x)) & \if\ y &= 0 \end{align}$

When we put them together we have:

j (θ) = \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - h_{θ} (x)^{(i)})]

$j(\theta) = \frac 1 m \sum_{i=1}^m \big[y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x)^{(i)}) \big]$

Multiplying by $y$ and $(1−y)$ in the above equation is a sneaky trick that let’s us use the same equation to solve for both $y=1$ and $y=0$ cases. If $y=0$ , the first side cancels out. If $y=1$ , the second side cancels out. In both cases we only perform the operation we need to perform.

If you don't want to use a for loop, you can try a vectorized form of the equation above

\begin{aligned} h & = g (X θ) \\ J (θ) & = \frac{1}{m} \cdot (- y^{T} \log (h) - (1 - y)^{T} \log (1 - h)) \end{aligned}

$\begin{align} h &= g(X\theta) \\ J(\theta) &= \frac 1 m \cdot \big(-y^T\log(h)-(1-y)^T\log(1-h)\big) \end{align}$

The entire explanation can be view on Machine Learning Cheatsheet.

— Emanuel Fontelles
fonte