Retropropagação com Softmax / Cross Entropy

40

Estou tentando entender como a retropropagação funciona para uma camada de saída softmax / entropia cruzada.

A função de erro de entropia cruzada é

E (t, o) = - \sum_{j} t_{j} \log o_{j}

$E(t,o)=-\sum_j t_j \log o_j$

com $t$ e $o$ como alvo e saída no neurônio $j$ , respectivamente. A soma é sobre cada neurônio na camada de saída. $o_j$ em si é o resultado da função softmax:

o_{j} = s o f t m a x (z_{j}) = \frac{e^{z_{j}}}{\sum_{j} e^{z_{j}}}

$o_j=softmax(z_j)=\frac{e^{z_j}}{\sum_j e^{z_j}}$

Novamente, a soma está sobre cada neurônio na camada de saída e $z_j$ é a entrada do neurônio $j$ :

z_{j} = \sum_{i} w_{i j} o_{i} + b

$z_j=\sum_i w_{ij}o_i+b$

Que é a soma sobre todos os neurónios na camada anterior, com a sua saída correspondente $o_i$ e o peso $w_{ij}$ no sentido de neurónio $j$ mais um viés $b$ .

Agora, para atualizar um peso que conecta um neurônio na camada de saída com um neurônio $w_{ij}$ $j$ na camada anterior, preciso calcular a derivada parcial da função de erro usando a regra da cadeia: $i$

\frac{\partial E}{\partial w_{i j}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial z_{j}} \frac{\partial z_{j}}{\partial w_{i j}}

$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial o_j} \frac{\partial o_j} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$

com como entrada para o neurônio . $z_j$ $j$

O último termo é bastante simples. Uma vez que há apenas um peso entre e , o derivado é: $i$ $j$

\frac{\partial z_{j}}{\partial w_{i j}} = o_{i}

$\frac{\partial z_j} {\partial w_{ij}}=o_i$

O primeiro termo é a derivação da função de erro em relação à saída : $o_j$

\frac{\partial E}{\partial o_{j}} = \frac{- t_{j}}{o_{j}}

$\frac{\partial E} {\partial o_j} = \frac{-t_j}{o_j}$

O termo do meio é a derivação da função softmax em relação à sua entrada é mais difícil: $z_j$

\frac{\partial o_{j}}{\partial z_{j}} = \frac{\partial}{\partial z_{j}} \frac{e^{z_{j}}}{\sum_{j} e^{z_{j}}}

$\frac{\partial o_j} {\partial z_{j}}=\frac{\partial} {\partial z_{j}} \frac{e^{z_j}}{\sum_j e^{z_j}}$

Digamos que temos três neurônios de saída correspondentes às classes então $a,b,c$ $o_b = softmax(b)$ é:

o_{b} = \frac{e^{z_{b}}}{\sum e^{z}} = \frac{e^{z_{b}}}{e^{z_{a}} + e^{z_{b}} + e^{z_{c}}}

$o_b=\frac{e^{z_b}}{\sum e^{z}}=\frac{e^{z_b}}{e^{z_a}+e^{z_b}+e^{z_c}}$

e sua derivação usando a regra do quociente:

\frac{\partial o_{b}}{\partial z_{b}} = \frac{e^{z_{b}} * \sum e^{z} - (e^{z_{b}})^{2}}{(\sum_{j} e^{z})^{2}} = \frac{e^{z_{b}}}{\sum e^{z}} - \frac{(e^{z_{b}})^{2}}{(\sum e^{z})^{2}}

$\frac{\partial o_b} {\partial z_{b}}=\frac{e^{z_b}*\sum e^z - (e^{z_b})^2}{(\sum_j e^{z})^2}=\frac{e^{z_b}}{\sum e^z}-\frac{(e^{z_b})^2}{(\sum e^z)^2}$

= s o f t m a x (b) - s o f t m a x^{2} (b) = o_{b} - o_{b}^{2} = o_{b} (1 - o_{b})

$=softmax(b)-softmax^2(b)=o_b-o_b^2=o_b(1-o_b)$ Back to the middle term for backpropagation this means:

\frac{\partial o_{j}}{\partial z_{j}} = o_{j} (1 - o_{j})

$\frac{\partial o_j} {\partial z_{j}}=o_j(1-o_j)$

Putting it all together I get

\frac{\partial E}{\partial w_{i j}} = \frac{- t_{j}}{o_{j}} * o_{j} (1 - o_{j}) * o_{i} = - t_{j} (1 - o_{j}) * o_{i}

$\frac{\partial E} {\partial w_{ij}}= \frac{-t_j}{o_j}*o_j(1-o_j)*o_i=-t_j(1-o_j)*o_i$

which means, if the target for this class is $t_j=0$ , then I will not update the weights for this. That does not sound right.

Investigating on this I found people having two variants for the softmax derivation, one where $i=j$ and the other for $i\ne j$ , like here or here.

But I can't make any sense out of this. Also I'm not even sure if this is the cause of my error, which is why I'm posting all of my calculations. I hope someone can clarify me where I am missing something or going wrong.

— micha
fonte

The links you have given are calculating the derivative relative to the input, whilst you're calculating the derivative relative to the weights.

— Jenkar

35

Note: I am not an expert on backprop, but now having read a bit, I think the following caveat is appropriate. When reading papers or books on neural nets, it is not uncommon for derivatives to be written using a mix of the standard summation/index notation, matrix notation, and multi-index notation (include a hybrid of the last two for tensor-tensor derivatives). Typically the intent is that this should be "understood from context", so you have to be careful!

I noticed a couple of inconsistencies in your derivation. I do not do neural networks really, so the following may be incorrect. However, here is how I would go about the problem.

First, you need to take account of the summation in $E$ , and you cannot assume each term only depends on one weight. So taking the gradient of $E$ with respect to component $k$ of $z$ , we have

E = - \sum_{j} t_{j} \log o_{j} ⟹ \frac{\partial E}{\partial z_{k}} = - \sum_{j} t_{j} \frac{\partial \log o_{j}}{\partial z_{k}}

$E=-\sum_jt_j\log o_j\implies\frac{\partial E}{\partial z_k}=-\sum_jt_j\frac{\partial \log o_j}{\partial z_k}$

$o_j$

o_{j} = \frac{1}{Ω} e^{z_{j}}, Ω = \sum_{i} e^{z_{i}} ⟹ \log o_{j} = z_{j} - \log Ω

$o_j=\tfrac{1}{\Omega}e^{z_j} \,,\, \Omega=\sum_ie^{z_i} \implies \log o_j=z_j-\log\Omega$

\frac{\partial \log o_{j}}{\partial z_{k}} = δ_{j k} - \frac{1}{Ω} \frac{\partial Ω}{\partial z_{k}}

$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-\frac{1}{\Omega}\frac{\partial\Omega}{\partial z_k}$

δ_{j k}

$\delta_{jk}$

\frac{\partial Ω}{\partial z_{k}} = \sum_{i} e^{z_{i}} δ_{i k} = e^{z_{k}}

$\frac{\partial\Omega}{\partial z_k}=\sum_ie^{z_i}\delta_{ik}=e^{z_k}$ which gives

\frac{\partial \log o_{j}}{\partial z_{k}} = δ_{j k} - o_{k}

$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-o_k$ or, expanding the log

\frac{\partial o_{j}}{\partial z_{k}} = o_{j} (δ_{j k} - o_{k})

$\frac{\partial o_j}{\partial z_k}=o_j(\delta_{jk}-o_k)$ Note that the derivative is with respect to

z_{k}

$z_k$ , an arbitrary component of

z

$z$ , which gives the

δ_{j k}

$\delta_{jk}$ term (

= 1

$=1$ only when

k = j

$k=j$ ).

So the gradient of $E$ with respect to $z$ is then

\frac{\partial E}{\partial z_{k}} = \sum_{j} t_{j} (o_{k} - δ_{j k}) = o_{k} (\sum_{j} t_{j}) - t_{k} ⟹ \frac{\partial E}{\partial z_{k}} = o_{k} τ - t_{k}

$\frac{\partial E}{\partial z_k}=\sum_jt_j(o_k-\delta_{jk})=o_k\left(\sum_jt_j\right)-t_k \implies \frac{\partial E}{\partial z_k}=o_k\tau-t_k$ where

τ = \sum_{j} t_{j}

$\tau=\sum_jt_j$ is constant (for a given

t

$t$ vector).

This shows a first difference from your result: the $t_k$ no longer multiplies $o_k$ . Note that for the typical case where $t$ is "one-hot" we have $\tau=1$ (as noted in your first link).

A second inconsistency, if I understand correctly, is that the " $o$ " that is input to $z$ seems unlikely to be the " $o$ " that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?

Calling this vector $y$ , we then have

z_{k} = \sum_{i} w_{i k} y_{i} + b_{k} ⟹ \frac{\partial z_{k}}{\partial w_{p q}} = \sum_{i} y_{i} \frac{\partial w_{i k}}{\partial w_{p q}} = \sum_{i} y_{i} δ_{i p} δ_{k q} = δ_{k q} y_{p}

$z_k=\sum_iw_{ik}y_i+b_k \implies \frac{\partial z_k}{\partial w_{pq}}=\sum_iy_i\frac{\partial w_{ik}}{\partial w_{pq}}=\sum_iy_i\delta_{ip}\delta_{kq}=\delta_{kq}y_p$

Finally, to get the gradient of $E$ with respect to the weight-matrix $w$ , we use the chain rule

\frac{\partial E}{\partial w_{p q}} = \sum_{k} \frac{\partial E}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{p q}} = \sum_{k} (o_{k} τ - t_{k}) δ_{k q} y_{p} = y_{p} (o_{q} τ - t_{q})

$\frac{\partial E}{\partial w_{pq}}=\sum_k\frac{\partial E}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}=\sum_k(o_k\tau-t_k)\delta_{kq}y_p=y_p(o_q\tau-t_q)$ giving the final expression (assuming a one-hot

t

$t$ , i.e.

τ = 1

$\tau=1$ )

\frac{\partial E}{\partial w_{i j}} = y_{i} (o_{j} - t_{j})

$\frac{\partial E}{\partial w_{ij}}=y_i(o_j-t_j)$ where

y

$y$ is the input on the lowest level (of your example).

So this shows a second difference from your result: the " $o_i$ " should presumably be from the level below $z$ , which I call $y$ , rather than the level above $z$ (which is $o$ ).

Hopefully this helps. Does this result seem more consistent?

Update: In response to a query from the OP in the comments, here is an expansion of the first step. First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have
$\frac{\partial E}{\partial w_{p q}} = \sum_{i} \frac{\partial E}{\partial o_{i}} \frac{\partial o_{i}}{\partial w_{p q}}$ $\frac{\partial E}{\partial w_{pq}}=\sum_i \frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial w_{pq}}$ and $\frac{\partial o_{i}}{\partial w_{p q}} = \sum_{k} \frac{\partial o_{i}}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{p q}}$ $\frac{\partial o_i}{\partial w_{pq}}=\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}$ so $\frac{\partial E}{\partial w_{p q}} = \sum_{i} [\frac{\partial E}{\partial o_{i}} (\sum_{k} \frac{\partial o_{i}}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{p q}})]$ $\frac{\partial E}{\partial w_{pq}}=\sum_i \left[ \frac{\partial E}{\partial o_i}\left(\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}\right) \right]$ In practice the full summations reduce, because you get a lot of $\delta_{ab}$ terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.

— GeoMatt22
fonte

I am not certain how the "Backprop/AutoDiff" community does these problems, but I find any time I try to take shortcuts, I am liable to make errors. So I end up doing as here, writing everything out in terms of summations with full subscripting, and always introducing new subscripts for every derivative. (Similar to my answer here ... I hope I am at least giving correct results in the end!)

— GeoMatt22

I personally find that you writing everything down makes it much easier to follow. The results look correct to me.

— Jenkar

Although I'm still trying to fully understand each of your steps, I got some valuable insights that helped me with the overall picture. I guess I need to read more into the topic of derivations and sums. But taking your advise to take account of the summation in E, I came up with this:

— micha

for two outputs

o_{j_{1}} = \frac{e^{z_{j_{1}}}}{Ω}

$o_{j_1}=\frac{e^{z_{j_1}}}{\Omega}$ and

o_{j_{1}} = \frac{e^{z_{j_{1}}}}{Ω}

$o_{j_1}=\frac{e^{z_{j_1}}}{\Omega}$ with

Ω = e^{z_{j_{1}}} + e^{z_{j_{2}}}

$\Omega=e^{z_{j_1}}+e^{z_{j_2}}$ the cross entropy error is

E = - (t_{1} l o g o_{j_{1}} + t_{2} l o g o_{j_{2}}) = - (t_{1} (z_{j_{1}} - l o g (Ω)) + t_{2} (z_{j_{2}} - l o g (Ω)))

$E=-(t_1 log o_{j_1}+t_2 log o_{j_2})=-(t_1(z_{j_1}-log(\Omega))+t_2(z_{j_2}-log(\Omega)))$ Then the derivative is

\frac{\partial E}{\partial (z_{j_{1}}} = - (t_{1} - t_{1} \frac{e^{z_{j_{1}}}}{Ω} - t_{2} \frac{e^{z_{j_{2}}}}{Ω}) = - t_{1} + o_{j_{1}} (t_{1} + t_{2})

$\frac{\partial E}{\partial (z_{j_1}}=-(t_1-t_1 \frac{e^{z_{j_1}}}{\Omega}-t_2 \frac{e^{z_{j_2}}}{\Omega})=-t_1+o_{j_1}(t_1+t_2)$ which conforms with your result... taking in account that you didn't have the minus sign before the error sum

— micha

But a further question I have is: Instead of

\frac{\partial E}{\partial w_{i j}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial z_{j}} \frac{\partial z_{j}}{\partial w_{i j}}

$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial o_j} \frac{\partial o_j} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$ which is generally what your introduced to with backpropagation, you calculated:

\frac{\partial E}{\partial w_{i j}} = \frac{\partial E}{\partial z_{j}} \frac{\partial z_{j}}{\partial w_{i j}}

$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$ as like to cancel out the

\partial o_{j}

$\partial o_j$ . Why is this way leading to the right result?

— micha

12

While @GeoMatt22's answer is correct, I personally found it very useful to reduce the problem to a toy example and draw a picture:

I then defined the operations each node was computing, treating the $h$ 's and $w$ 's as inputs to a "network" ( $\mathbf{t}$ is a one-hot vector representing the class label of the data point):

L = - t_{1} \log o_{1} - t_{2} \log o_{2}

$L=-t_1\log o_1 -t_2\log o_2$

o_{1} = \frac{\exp (y_{1})}{\exp (y_{1}) + \exp (y_{2})}

$o_1 = \frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)}$

o_{2} = \frac{\exp (y_{2})}{\exp (y_{1}) + \exp (y_{2})}

$o_2 = \frac{\exp(y_2)}{\exp(y_1) + \exp(y_2)}$

y_{1} = w_{11} h_{1} + w_{21} h_{2} + w_{31} h_{3}

$y_1 = w_{11}h_1 + w_{21}h_2 + w_{31}h_3$

y_{2} = w_{12} h_{1} + w_{22} h_{2} + w_{32} h_{3}

$y_2 = w_{12}h_1 + w_{22}h_2 + w_{32}h_3$

Say I want to calculate the derivative of the loss with respect to $w_{21}$ . I can just use my picture to trace back the path from the loss to the weight I'm interested in (removed the second column of $w$ 's for clarity):

Then, I can just calculate the desired derivatives. Note that there are two paths through $y_1$ that lead to $w_{21}$ , so I need to sum the derivatives that go through each of them.

\frac{\partial L}{\partial o_{1}} = - \frac{t_{1}}{o_{1}}

$\frac{\partial L}{\partial o_1} = -\frac{t_1}{o_1}$

\frac{\partial L}{\partial o_{2}} = - \frac{t_{2}}{o_{2}}

$\frac{\partial L}{\partial o_2} = -\frac{t_2}{o_2}$

\frac{\partial o_{1}}{\partial y_{1}} = \frac{\exp (y_{1})}{\exp (y_{1}) + \exp (y_{2})} - {(\frac{\exp (y_{1})}{\exp (y_{1}) + \exp (y_{2})})}^{2} = o_{1} (1 - o_{1})

$\frac{\partial o_1}{\partial y_1} = \frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)} - \left(\frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)}\right)^2 = o_1(1 - o_1)$

\frac{\partial o_{2}}{\partial y_{1}} = \frac{- \exp (y_{2}) \exp (y_{1})}{(\exp (y_{1}) + \exp (y_{2}))^{2}} = - o_{2} o_{1}

$\frac{\partial o_2}{\partial y_1} = \frac{-\exp(y_2)\exp(y_1)}{(\exp(y_1) + \exp(y_2))^2} = -o_2o_1$

\frac{\partial y_{1}}{\partial w_{21}} = h_{2}

$\frac{\partial y_1}{\partial w_{21}} = h_2$

Finally, putting the chain rule together:

\begin{aligned} \frac{\partial L}{\partial w_{21}} & = \frac{\partial L}{\partial o_{1}} \frac{\partial o_{1}}{\partial y_{1}} \frac{\partial y_{1}}{\partial w_{21}} + \frac{\partial L}{\partial o_{2}} \frac{\partial o_{2}}{\partial y_{1}} \frac{\partial y_{1}}{\partial w_{21}} \\ = \frac{- t_{1}}{o_{1}} [o_{1} (1 - o_{1})] h_{2} + \frac{- t_{2}}{o_{2}} (- o_{2} o_{1}) h_{2} \\ = h_{2} (t_{2} o_{1} - t_{1} + t_{1} o_{1}) \\ = h_{2} (o_{1} (t_{1} + t_{2}) - t_{1}) \\ = h_{2} (o_{1} - t_{1}) \end{aligned}

$\begin{align} \frac{\partial L}{\partial w_{21}} &= \frac{\partial L}{\partial o_1}\frac{\partial o_1}{\partial y_1}\frac{\partial y_1}{\partial w_{21}} + \frac{\partial L}{\partial o_2}\frac{\partial o_2}{\partial y_1}\frac{\partial y_1}{\partial w_{21}}\\ &= \frac{-t_1}{o_1}[o_1(1 - o_1)]h_2 + \frac{-t_2}{o_2}(-o_2 o_1)h_2\\ &= h_2(t_2 o_1 - t_1 + t_1 o_1)\\ &= h_2(o_1(t_1 + t_2) - t_1)\\ &= h_2(o_1 - t_1) \end{align}$

Note that in the last step, $t_1 + t_2 = 1$ because the vector $\mathbf{t}$ is a one-hot vector.

— Vivek Subramanian
fonte

This is what finally cleared this up for me! Excellent and Elegant explanation!!!!

— SantoshGupta7

2

I’m glad you both enjoyed and benefited from reading my post! It was also helpful for me to write it out and explain it.

— Vivek Subramanian

@VivekSubramanian should it be

= \frac{- t_{1}}{o_{1}} [o_{1} (1 - o_{1})] h_{2} + \frac{- t_{2}}{o_{2}} (- o_{2} o_{1}) h_{2}

$= \frac{-t_1}{o_1}[o_1(1 - o_1)]h_2 + \frac{-t_2}{o_2}(-o_2 o_1)h_2\\$ instead ?

— koryakinp

You’re right - it was a typo! I will make the change.

— Vivek Subramanian

The thing i do not understand here is that you also assign logits (unscaled scores) to some neurons. (o is softmaxed logits (predictions) and y is logits in your case). However, this is not the case normally, is not it? Look at this picture ( o_out1 is prediction and o_in1 is logits) so how is it possible in this case how can you find the partial derivative of o2 with respect to y1?

— ARAT

6

In place of the $\{o_i\},\,$ I want a letter whose uppercase is visually distinct from its lowercase. So let me substitute $\{y_i\}$ . Also, let's use the variable $\{p_i\}$ to designate the $\{o_i\}$ from the previous layer.

Let $Y$ be the diagonal matrix whose diagonal equals the vector $y$ , i.e.

Y = D i a g (y)

$Y={\rm Diag}(y)$ Using this new matrix variable and the Frobenius Inner Product we can calculate the gradient of

E

$E$ wrt

W

$W$ .

\begin{aligned} z & = W p + b & d z = d W p \\ y & = s o f t m a x (z) & d y = (Y - y y^{T}) d z \\ E & = - t : \log (y) & d E = - t : Y^{- 1} d y \\ d E & = - t : Y^{- 1} (Y - y y^{T}) d z \\ = - t : (I - 1 y^{T}) d z \\ = - t : (I - 1 y^{T}) d W p \\ = (y 1^{T} - I) t p^{T} : d W \\ = ((1^{T} t) y p^{T} - t p^{T}) : d W \\ \frac{\partial E}{\partial W} & = (1^{T} t) y p^{T} - t p^{T} \end{aligned}

$\eqalign{ z &= Wp+b &dz= dWp \cr y &= {\rm softmax}(z) &dy = (Y-yy^T)\,dz \cr E &= -t:\log(y) &dE = -t:Y^{-1}dy \cr\cr dE &= -t:Y^{-1}(Y-yy^T)\,dz \cr &= -t:(I-1y^T)\,dz \cr &= -t:(I-1y^T)\,dW\,p \cr &= (y1^T-I)tp^T:dW \cr &= ((1^Tt)yp^T - tp^T):dW \cr\cr \frac{\partial E}{\partial W} &= (1^Tt)yp^T - tp^T \cr }$

— frank
fonte

6

Here is one of the cleanest and well written notes that I came across the web which explains about "calculation of derivatives in backpropagation algorithm with cross entropy loss function".

— yottabytt
fonte

In the given pdf how did equation 22 become equation 23? As in how did the Summation(k!=i) get a negative sign. Shouldn't it get a positive sign? Like Summation(Fn)(For All K) = Fn(k=i) + Summation(Fn)(k!=i) should be happening according to my understanding.

— faizan

1

Here's a link explaining the softmax and its derivative.

It explains the reason for using i=j and i!=j.

— S. Muhammad H. Mustafa
fonte

It is recommended to provide a minimal, stand-alone answer, in case that link gets broken in the future. Otherwise, this might no longer help other users in the future.

— luchonacho

0

Other answers have provided the correct way of calculating the derivative, but they do not point out where you have gone wrong. In fact, $t_j$ is always 1 in your last equation, cause you have assumed that $o_j$ takes that node of target 1 in your output; $o_j$ of other nodes have different forms of probability function, thus lead to different forms of derivative, so you should now understand why other people have treated $i=j$ and $i\neq j$ differently.

— kuixiong
fonte